-
Notifications
You must be signed in to change notification settings - Fork 941
Add terraform for provisioning s390x build cluster on ibmcloud #8413
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Welcome @sudharshanibm! |
|
Hi @sudharshanibm. Thanks for your PR. I'm waiting for a kubernetes member to verify that this patch is reasonable to test. If it is, they should reply with Once the patch is verified, the new status will be reflected by the I understand the commands that are listed here. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
167d452 to
2867a10
Compare
2867a10 to
1ef2ece
Compare
| create_duration = "180s" # Wait 3 minutes for full initialization | ||
| } | ||
|
|
||
| resource "null_resource" "bastion_setup" { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is not very declarative - is this truly required or can the verification be run as a one-off by someone? do we expect things to not be setup correct via Terraform (i.e. - verification here fails) periodically or at all? (if so - that seems problematic and who or what is intended to fix that 😅 )
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The bastion instance’s initial setup and networking configuration are handled through a cloud-init script in user_data. Runtime verification of the bastion’s internal state is treated as an operational concern, and can be performed outside of Terraform via automation/manual checks.
| vpc = data.ibm_is_vpc.vpc.id | ||
| } | ||
|
|
||
| data "ibm_is_security_group" "master_sg" { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we use an improved alternate terminology? main, primary, control-plane?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
updated all occurrences of master to control-plane
| @@ -0,0 +1,129 @@ | |||
| /* | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the use of k8s-* throughout the file names, resource names, etc. seems to be redundant (i.e. - its defined in a K8s repo/org so its assumed to be for K8s, no need to re-state that in the names)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @bryantbiggs , thanks for pointing that out
I followed the same naming convention since the existing s390x setup is already connected to ArgoCD with k8s-* prefixed names (ref: #8332 ).
| See the License for the specific language governing permissions and | ||
| limitations under the License. | ||
| */ | ||
| resource "ibm_is_lb" "k8s_load_balancer" { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the names should be viewed as a hierarchy level just below the resource type, but read together. so ibm_is_lb.k8s_load_balancer is essentially IBM load balancer - Kubernetes load balancer
what about something shorter and more descriptive public (ibm_is.lb.public)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated k8s_load_balancer to public
| limitations under the License. | ||
| */ | ||
| resource "ibm_is_lb" "k8s_load_balancer" { | ||
| name = "k8s-s390x-lb" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
again, the name is very redundant. Its in a K8s repo, is this infrastructure deployed in an account that hosts resources for things outside of K8s? If yes = fine, keep the k8s, if not, drop it. Also, the resource is a load balancer so no need to add that to the name (my name is Bryant Biggs not Bryant Biggs human 😅 ). Lastly, I assume this is only for testing on s390x platforms so is that adding anything to the name? Maybe it is, maybe it isn't - I don't have enough context to say. But what about something more descriptive as to its intent and aligned with resources already defined in the repo
| kops-infra-ci-name = "kops-infra-ci" |
what about s390x-ci? or if you desire the full name k8s-s390x-ci?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note: this same thinking above applies throughout the change set
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
updated k8s-s390x-lb to k8s-s390x-ci
| health_timeout = 2 | ||
| health_type = "tcp" | ||
| health_monitor_url = "/" | ||
| health_monitor_port = 6443 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
6443 appears in a number of places - good chance it should be a local variable to avoid mis-matches
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
refactored to use a variable api_server_port (default 6443)
| name = "k8s-s390x-subnet" | ||
| } | ||
|
|
||
| data "ibm_is_security_group" "bastion_sg" { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| data "ibm_is_security_group" "bastion_sg" { | |
| data "ibm_is_security_group" "bastion" { |
Security group is already in the resource type - don't repeat in the resource name
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I already have resource ibm_is_instance bastion defined in bastion.tf, so reusing bastion here would cause a naming conflict
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
no - thats a different resource from this data source so no conflict
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
renamed bastion_sg --> bastion
| } | ||
|
|
||
| resource "ibm_is_instance" "control_plane" { | ||
| count = var.control_plane.count |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this should be a for_each so that we avoid amplifying disruptions
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
updated the control-plane to use for_each in nodes.tf
| default = "eu-de-1" | ||
| } | ||
|
|
||
| variable "resource_group_name" { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should this be a variable?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removed the variable, Thanks!
| } | ||
| } | ||
|
|
||
| variable "bastion" { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should this be a variable?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
bastion now generated in locals
| } | ||
| } | ||
|
|
||
| variable "control_plane" { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should this be a variable?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
control-plane now generated in locals
| } | ||
| } | ||
|
|
||
| variable "compute" { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should this be a variable?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
compute now generated in locals
| @@ -0,0 +1,136 @@ | |||
| /* | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
variables are mostly used in modules - I would argue, are these required? Should they be local variables instead (if they are used in multiple places but not intended to be configured dynamically outside of whats stored in git)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In this case, these values are intended to be configurable by automation (Ansible playbooks) rather than hard-coded in the repo
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ansible to deploy Terraform - why?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for circling back on this. I realize my earlier explanation may have caused some confusion. What I meant is that while we do have downstream automation that consumes certain outputs for cluster setup, that doesn’t mean all of the Terraform configuration itself needs to be exposed as variables.
After revisiting your feedback, I’ve updated the code so that only true external inputs like API key, region, secrets manager ID, counts, profiles, etc., remain as variables. The structured definitions for bastion, control-plane, and compute nodes are now generated in locals, which keeps the interface cleaner and avoids over-exposing internal implementation details
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I still don't grok why most of these variables need to exist (ignoring the API key for now). for example, we can't simply change the region nor do I believe we intended on doing so. if we need to set the name of the region in multiple locations, thats what a local variable is good for.
variables are public inputs for a public interface
local variables are a private implementation detail
| } | ||
|
|
||
| resource "ibm_is_instance" "compute" { | ||
| count = var.compute.count |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
likewise, use for_each
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
updated the compute to use for_each in nodes.tf
| } | ||
| } | ||
|
|
||
| resource "null_resource" "node_setup" { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this doesn't appear to be doing anything - can it be removed?
| @@ -0,0 +1,43 @@ | |||
| /* | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
outputs are intended to be consumed elsewhere - are all of these intended to be consumed somewhere else?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, these outputs are consumed in our Ansible automation for installing Kubernetes. Specifically, the IPs and load balancer hostname are passed into the Ansible playbook ansible-playbook -v -i examples/k8s-build-cluster/hosts.yml install-k8s-ha.yaml -e @group_vars/bastion_configuration --extra-vars @group_vars/all to configure the bastion, control-plane, and worker nodes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hmm, I don't know I fully follow this. so the plan would be to write these outputs to file or? how do you bridge the gap between an output in Terraform and said output landing in a file on disk that ansible then uses? Is that ansible definition going to live in this repo as well? Should we store these values somewhere that Ansible can readily retrieve - S3/SSM/etc?
| zone = "eu-de-1" | ||
| } | ||
|
|
||
| provider "ibm" { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
where is the min required versions of Terraform and the providers defined?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
defined in versions.tf file specifies both the Terraform version constraint and the provider versions
| limitations under the License. | ||
| */ | ||
| resource "ibm_is_vpc" "vpc" { | ||
| name = "k8s-s390x-vpc" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
drop the redundant resource name from the naming scheme throughout
- remove
-vpchere
|
|
||
| # VPC | ||
| resource "ibm_is_public_gateway" "public_gw" { | ||
| name = "k8s-s390x-public-gw" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| name = "k8s-s390x-public-gw" | |
| name = "k8s-s390x" |
|
|
||
| # Subnet | ||
| resource "ibm_is_subnet" "subnet" { | ||
| name = "k8s-s390x-subnet" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
most likely you can create a local variable of
locals {
name = "k8s-s390x"
}and for all these resource names, just simple update to use
name = local.nameDon't add the -gw or -sg or -vpc, etc.
|
|
||
| # Security Groups | ||
| resource "ibm_is_security_group" "bastion_sg" { | ||
| name = "k8s-vpc-s390x-bastion-sg" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this would become
| name = "k8s-vpc-s390x-bastion-sg" | |
| name = "${local.name}-bastion" |
| @@ -0,0 +1 @@ | |||
| ibmcloud_api_key = "<YOUR_API_KEY>" | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
|
FYI - I don't have approval authority on changes such as these in this repo; these are just my suggestions based on working with Terraform in various capacities over the years the sig-k8s-infra-leads are the final approving authority |
369d8da to
dd30419
Compare
|
Thank you so much for your suggestions and a comprehensive review, @bryantbiggs 🙏 @upodroid , could you kindly advise whether we should keep the current structure for consistency or apply the suggested changes? |
| locals { | ||
| key = var.ibmcloud_api_key | ||
| region = "eu-de" | ||
| zone = "eu-de-1" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since you’ve defined a variable for the zone, shouldn’t this value come from it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looks okay to me, just a suggestion check if we can make common syntax rules creation generic to avoid repetition ref - (https://github.com/kubernetes/k8s.io/blob/main/infra/ibmcloud/terraform/k8s-infra-setup/modules/vpc/vpc.tf#L30-L42)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I’m fine with the current approach if that’s what you’re comfortable with.
| } | ||
| } | ||
|
|
||
| variable "control_plane_node_count" { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(just a nit) Is it possible to have something like this -
variable "control_plane" {
default = {
count = 5
profile = "bz2-8x32"
boot_volume_size = 100
}
}
similar comment for compute, bastion
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe we can avoid this for now, if you are comfortable with present design from maintainability POV.
| variable "bastion_private_ip" { | ||
| description = "Private IP address for the bastion's secondary interface" | ||
| type = string | ||
| default = "192.168.100.10" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why this default value? (Is this always going to be same unless specified in case of new build cluster creation)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for pointing that out
the default value 192.168.100.10 is used for consistency across our internal build environments, it serves as the predefined private IP for the bastion’s secondary interface during initial cluster setups.
However, it’s not hard-coded in the sense that it must always remain the same, we can override it via terraform.tfvars if a different subnet or IP is required for a new build cluster
| variable "image_id" { | ||
| type = string | ||
| description = "Image ID for instances" | ||
| default = "r010-bdd9c78f-4a2f-441f-a375-2ed288dcad15" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why this default value? (Is this always going to be same across multiple regions, If not we should try using the name if it will be same across regions)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated the default from a region-specific image_id to the image name (ibm-ubuntu-24-04-3-minimal-s390x-2) so it’s consistent across regions
| skip_credentials_validation = true | ||
| endpoints = { | ||
| s3 = "https://s3.eu.cloud-object-storage.appdomain.cloud" | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
did you miss adding placeholders for access_key and secret_key here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @Prajyot-Parab
I had removed the access_key and secret_key placeholders based on feedback from @bryantbiggs in an earlier review (ref: PR comment).
we’ve shifted to using HMAC credentials via environment variables instead of hard-coding keys in versions.tf Terraform automatically picks these up when set like this:
export AWS_ACCESS_KEY_ID="<HMAC_ACCESS_KEY_ID>"
export AWS_SECRET_ACCESS_KEY="<HMAC_SECRET_ACCESS_KEY>"
README has also been updated with these instructions to ensure credentials are managed securely outside the code
| provider "ibm" { | ||
| alias = "vpc" | ||
| ibmcloud_api_key = local.key | ||
| region = "eu-de" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| region = "eu-de" | |
| region = local.region |
f6df2ce to
7ea49d9
Compare
| region = "eu-de" | ||
| zone = "eu-de-1" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since you’ve defined a variable for the zone & region, shouldn’t these values come from there?
7ea49d9 to
932dd09
Compare
| vpc = data.ibm_is_vpc.vpc.id | ||
| zone = var.zone | ||
| profile = each.value.profile | ||
| image = var.image_id |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You need to fetch the image id from image data resource using name and pass it here, image expects only ID of image.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Prajyot-Parab Thanks for catching this! You’re right
I’ve updated the code to use a data source for fetching the image ID from the image name
Changes:
- Added data "
ibm_is_image" "os_image" for image lookup - Updated all instance resources to use
data.ibm_is_image.os_image.id - Renamed variable to
image_namewith updated default value
932dd09 to
289f9fd
Compare
Signed-off-by: Sudharshan Muralidharan <sudharshan.muralidharan1@ibm.com>
289f9fd to
e8bd0cd
Compare
Prajyot-Parab
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/lgtm
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: Prajyot-Parab, sudharshanibm The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
Add terraform for provisioning s390x build cluster on ibmcloud