Skip to content

Conversation

@sudharshanibm
Copy link
Contributor

@sudharshanibm sudharshanibm commented Aug 18, 2025

Add terraform for provisioning s390x build cluster on ibmcloud

  • k8s-s390x-infra-setup: Terraform resources to set up the necessary infrastructure in an IBM Cloud account, which will serve as prerequisites for the k8s-s390x-build-cluster automation.
  • k8s-s390x-build-cluster: Terraform resources to provision the s390x Build Cluster infrastructure in IBM Cloud.

@k8s-ci-robot k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Aug 18, 2025
@k8s-ci-robot
Copy link
Contributor

Welcome @sudharshanibm!

It looks like this is your first PR to kubernetes/k8s.io 🎉. Please refer to our pull request process documentation to help your PR have a smooth ride to approval.

You will be prompted by a bot to use commands during the review process. Do not be afraid to follow the prompts! It is okay to experiment. Here is the bot commands documentation.

You can also check if kubernetes/k8s.io has its own contribution guidelines.

You may want to refer to our testing guide if you run into trouble with your tests not passing.

If you are having difficulty getting your pull request seen, please follow the recommended escalation practices. Also, for tips and tricks in the contribution process you may want to read the Kubernetes contributor cheat sheet. We want to make sure your contribution gets all the attention it needs!

Thank you, and welcome to Kubernetes. 😃

@k8s-ci-robot
Copy link
Contributor

Hi @sudharshanibm. Thanks for your PR.

I'm waiting for a kubernetes member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@k8s-ci-robot k8s-ci-robot added size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. area/infra Infrastructure management, infrastructure design, code in infra/ labels Aug 18, 2025
@k8s-ci-robot k8s-ci-robot added the sig/k8s-infra Categorizes an issue or PR as relevant to SIG K8s Infra. label Aug 18, 2025
@sudharshanibm sudharshanibm force-pushed the k8s-s390x-infra branch 2 times, most recently from 167d452 to 2867a10 Compare September 1, 2025 05:14
create_duration = "180s" # Wait 3 minutes for full initialization
}

resource "null_resource" "bastion_setup" {
Copy link
Contributor

@bryantbiggs bryantbiggs Sep 13, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is not very declarative - is this truly required or can the verification be run as a one-off by someone? do we expect things to not be setup correct via Terraform (i.e. - verification here fails) periodically or at all? (if so - that seems problematic and who or what is intended to fix that 😅 )

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The bastion instance’s initial setup and networking configuration are handled through a cloud-init script in user_data. Runtime verification of the bastion’s internal state is treated as an operational concern, and can be performed outside of Terraform via automation/manual checks.

vpc = data.ibm_is_vpc.vpc.id
}

data "ibm_is_security_group" "master_sg" {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we use an improved alternate terminology? main, primary, control-plane?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated all occurrences of master to control-plane

@@ -0,0 +1,129 @@
/*
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the use of k8s-* throughout the file names, resource names, etc. seems to be redundant (i.e. - its defined in a K8s repo/org so its assumed to be for K8s, no need to re-state that in the names)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @bryantbiggs , thanks for pointing that out
I followed the same naming convention since the existing s390x setup is already connected to ArgoCD with k8s-* prefixed names (ref: #8332 ).

See the License for the specific language governing permissions and
limitations under the License.
*/
resource "ibm_is_lb" "k8s_load_balancer" {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the names should be viewed as a hierarchy level just below the resource type, but read together. so ibm_is_lb.k8s_load_balancer is essentially IBM load balancer - Kubernetes load balancer

what about something shorter and more descriptive public (ibm_is.lb.public)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated k8s_load_balancer to public

limitations under the License.
*/
resource "ibm_is_lb" "k8s_load_balancer" {
name = "k8s-s390x-lb"
Copy link
Contributor

@bryantbiggs bryantbiggs Sep 13, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

again, the name is very redundant. Its in a K8s repo, is this infrastructure deployed in an account that hosts resources for things outside of K8s? If yes = fine, keep the k8s, if not, drop it. Also, the resource is a load balancer so no need to add that to the name (my name is Bryant Biggs not Bryant Biggs human 😅 ). Lastly, I assume this is only for testing on s390x platforms so is that adding anything to the name? Maybe it is, maybe it isn't - I don't have enough context to say. But what about something more descriptive as to its intent and aligned with resources already defined in the repo

kops-infra-ci-name = "kops-infra-ci"

what about s390x-ci? or if you desire the full name k8s-s390x-ci?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note: this same thinking above applies throughout the change set

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated k8s-s390x-lb to k8s-s390x-ci

health_timeout = 2
health_type = "tcp"
health_monitor_url = "/"
health_monitor_port = 6443
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

6443 appears in a number of places - good chance it should be a local variable to avoid mis-matches

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

refactored to use a variable api_server_port (default 6443)

name = "k8s-s390x-subnet"
}

data "ibm_is_security_group" "bastion_sg" {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
data "ibm_is_security_group" "bastion_sg" {
data "ibm_is_security_group" "bastion" {

Security group is already in the resource type - don't repeat in the resource name

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I already have resource ibm_is_instance bastion defined in bastion.tf, so reusing bastion here would cause a naming conflict

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no - thats a different resource from this data source so no conflict

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

renamed bastion_sg --> bastion

}

resource "ibm_is_instance" "control_plane" {
count = var.control_plane.count
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this should be a for_each so that we avoid amplifying disruptions

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated the control-plane to use for_each in nodes.tf

default = "eu-de-1"
}

variable "resource_group_name" {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should this be a variable?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed the variable, Thanks!

}
}

variable "bastion" {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should this be a variable?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

bastion now generated in locals

}
}

variable "control_plane" {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should this be a variable?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

control-plane now generated in locals

}
}

variable "compute" {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should this be a variable?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

compute now generated in locals

@@ -0,0 +1,136 @@
/*
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

variables are mostly used in modules - I would argue, are these required? Should they be local variables instead (if they are used in multiple places but not intended to be configured dynamically outside of whats stored in git)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In this case, these values are intended to be configurable by automation (Ansible playbooks) rather than hard-coded in the repo

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ansible to deploy Terraform - why?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for circling back on this. I realize my earlier explanation may have caused some confusion. What I meant is that while we do have downstream automation that consumes certain outputs for cluster setup, that doesn’t mean all of the Terraform configuration itself needs to be exposed as variables.

After revisiting your feedback, I’ve updated the code so that only true external inputs like API key, region, secrets manager ID, counts, profiles, etc., remain as variables. The structured definitions for bastion, control-plane, and compute nodes are now generated in locals, which keeps the interface cleaner and avoids over-exposing internal implementation details

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I still don't grok why most of these variables need to exist (ignoring the API key for now). for example, we can't simply change the region nor do I believe we intended on doing so. if we need to set the name of the region in multiple locations, thats what a local variable is good for.

variables are public inputs for a public interface
local variables are a private implementation detail

}

resource "ibm_is_instance" "compute" {
count = var.compute.count
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

likewise, use for_each

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated the compute to use for_each in nodes.tf

}
}

resource "null_resource" "node_setup" {
Copy link
Contributor

@bryantbiggs bryantbiggs Sep 13, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this doesn't appear to be doing anything - can it be removed?

@@ -0,0 +1,43 @@
/*
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

outputs are intended to be consumed elsewhere - are all of these intended to be consumed somewhere else?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, these outputs are consumed in our Ansible automation for installing Kubernetes. Specifically, the IPs and load balancer hostname are passed into the Ansible playbook ansible-playbook -v -i examples/k8s-build-cluster/hosts.yml install-k8s-ha.yaml -e @group_vars/bastion_configuration --extra-vars @group_vars/all to configure the bastion, control-plane, and worker nodes.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm, I don't know I fully follow this. so the plan would be to write these outputs to file or? how do you bridge the gap between an output in Terraform and said output landing in a file on disk that ansible then uses? Is that ansible definition going to live in this repo as well? Should we store these values somewhere that Ansible can readily retrieve - S3/SSM/etc?

zone = "eu-de-1"
}

provider "ibm" {
Copy link
Contributor

@bryantbiggs bryantbiggs Sep 13, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

where is the min required versions of Terraform and the providers defined?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

defined in versions.tf file specifies both the Terraform version constraint and the provider versions

limitations under the License.
*/
resource "ibm_is_vpc" "vpc" {
name = "k8s-s390x-vpc"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

drop the redundant resource name from the naming scheme throughout

  • remove -vpc here


# VPC
resource "ibm_is_public_gateway" "public_gw" {
name = "k8s-s390x-public-gw"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
name = "k8s-s390x-public-gw"
name = "k8s-s390x"


# Subnet
resource "ibm_is_subnet" "subnet" {
name = "k8s-s390x-subnet"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

most likely you can create a local variable of

locals {
  name = "k8s-s390x"
}

and for all these resource names, just simple update to use

  name = local.name

Don't add the -gw or -sg or -vpc, etc.


# Security Groups
resource "ibm_is_security_group" "bastion_sg" {
name = "k8s-vpc-s390x-bastion-sg"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this would become

Suggested change
name = "k8s-vpc-s390x-bastion-sg"
name = "${local.name}-bastion"

@@ -0,0 +1 @@
ibmcloud_api_key = "<YOUR_API_KEY>"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@bryantbiggs
Copy link
Contributor

FYI - I don't have approval authority on changes such as these in this repo; these are just my suggestions based on working with Terraform in various capacities over the years

the sig-k8s-infra-leads are the final approving authority

@sudharshanibm
Copy link
Contributor Author

Thank you so much for your suggestions and a comprehensive review, @bryantbiggs 🙏
We have followed the same structure and naming pattern as used in the IBM Power architecture PR (#7889), which was reviewed and merged by @upodroid .
Since the s390x cluster is already live and provided to the community for testing, we wanted to check if it would be preferable to maintain the current naming for consistency across architectures, unless the sig-k8s-infra-leads recommend otherwise.

@upodroid , could you kindly advise whether we should keep the current structure for consistency or apply the suggested changes?

locals {
key = var.ibmcloud_api_key
region = "eu-de"
zone = "eu-de-1"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since you’ve defined a variable for the zone, shouldn’t this value come from it?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks okay to me, just a suggestion check if we can make common syntax rules creation generic to avoid repetition ref - (https://github.com/kubernetes/k8s.io/blob/main/infra/ibmcloud/terraform/k8s-infra-setup/modules/vpc/vpc.tf#L30-L42)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I’m fine with the current approach if that’s what you’re comfortable with.

}
}

variable "control_plane_node_count" {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(just a nit) Is it possible to have something like this -

variable "control_plane" {
  default = {
    count      = 5
    profile     = "bz2-8x32"
    boot_volume_size = 100
  }
}

similar comment for compute, bastion

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe we can avoid this for now, if you are comfortable with present design from maintainability POV.

variable "bastion_private_ip" {
description = "Private IP address for the bastion's secondary interface"
type = string
default = "192.168.100.10"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why this default value? (Is this always going to be same unless specified in case of new build cluster creation)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for pointing that out
the default value 192.168.100.10 is used for consistency across our internal build environments, it serves as the predefined private IP for the bastion’s secondary interface during initial cluster setups.
However, it’s not hard-coded in the sense that it must always remain the same, we can override it via terraform.tfvars if a different subnet or IP is required for a new build cluster

variable "image_id" {
type = string
description = "Image ID for instances"
default = "r010-bdd9c78f-4a2f-441f-a375-2ed288dcad15"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why this default value? (Is this always going to be same across multiple regions, If not we should try using the name if it will be same across regions)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated the default from a region-specific image_id to the image name (ibm-ubuntu-24-04-3-minimal-s390x-2) so it’s consistent across regions

skip_credentials_validation = true
endpoints = {
s3 = "https://s3.eu.cloud-object-storage.appdomain.cloud"
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

did you miss adding placeholders for access_key and secret_key here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @Prajyot-Parab
I had removed the access_key and secret_key placeholders based on feedback from @bryantbiggs in an earlier review (ref: PR comment).

we’ve shifted to using HMAC credentials via environment variables instead of hard-coding keys in versions.tf Terraform automatically picks these up when set like this:

export AWS_ACCESS_KEY_ID="<HMAC_ACCESS_KEY_ID>"
export AWS_SECRET_ACCESS_KEY="<HMAC_SECRET_ACCESS_KEY>"

README has also been updated with these instructions to ensure credentials are managed securely outside the code

provider "ibm" {
alias = "vpc"
ibmcloud_api_key = local.key
region = "eu-de"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
region = "eu-de"
region = local.region

@sudharshanibm sudharshanibm force-pushed the k8s-s390x-infra branch 2 times, most recently from f6df2ce to 7ea49d9 Compare November 10, 2025 07:12
Comment on lines 18 to 19
region = "eu-de"
zone = "eu-de-1"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since you’ve defined a variable for the zone & region, shouldn’t these values come from there?

vpc = data.ibm_is_vpc.vpc.id
zone = var.zone
profile = each.value.profile
image = var.image_id
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You need to fetch the image id from image data resource using name and pass it here, image expects only ID of image.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Prajyot-Parab Thanks for catching this! You’re right
I’ve updated the code to use a data source for fetching the image ID from the image name

Changes:

  • Added data "ibm_is_image" "os_image" for image lookup
  • Updated all instance resources to use data.ibm_is_image.os_image.id
  • Renamed variable to image_name with updated default value

Signed-off-by: Sudharshan Muralidharan <sudharshan.muralidharan1@ibm.com>
Copy link
Contributor

@Prajyot-Parab Prajyot-Parab left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Nov 10, 2025
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: Prajyot-Parab, sudharshanibm

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Nov 10, 2025
@k8s-ci-robot k8s-ci-robot merged commit e770486 into kubernetes:main Nov 10, 2025
3 checks passed
@k8s-ci-robot k8s-ci-robot added this to the v1.35 milestone Nov 10, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. area/infra Infrastructure management, infrastructure design, code in infra/ cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. lgtm "Looks good to me", indicates that a PR is ready to be merged. needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. sig/k8s-infra Categorizes an issue or PR as relevant to SIG K8s Infra. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants