Skip to content

Conversation

@rhubert
Copy link
Contributor

@rhubert rhubert commented Sep 23, 2024

Bundle the results of all checkoutSteps needed to build a package and provide a bundle-config file to be able to rebuild the package only with the bundle.

Such a bundle can be used to build on a air-gapped system or to archive all sources of a build. When building from the bundle only the bundle is extracted but no other checkoutScripts are run.

There is one issue with the actual URL-Scm extraction. The source workspace contains both, the original downloaded file and the extracted sources. This unnecessarily doubles the size of the bundle and - since the bundle-extraction uses the UrlScm as well - produces a different workspace hash when the bundle is extracted. That's why I changed to download the original file into workspace/../_download where also the .extracted file is placed. This change makes the unittest-failing ATM as the ../_download folder is always the same when using a tempDir.

I'm not sure how to proceed here:

  • fix the unit-tests
  • use a workspace folder (e.g. workspace/.bob-download/) for the downloaded files and exclude this directory when hashing / bundling?
  • leave the download location as is and ignore the downloaded + .extracted file?
  • ...?

@jkloetzke
Copy link
Member

I would argue that the tarball download optimization is some welcome but unrelated optimization. I would move it to some separate PR that should probably be merged first.

Reusing the URL SCM for the bundles is IMHO not the right approach. It should instead work like the archive stuff. Right now binary artifacts are used for saving or restoring package steps. What we need here is to save and restore checkout steps. In the best case we can build on the archive module and reuse most code from there.

From a more general angle: should this work for indeterministic checkouts too? I would argue against this and only bundle deterministic checkouts. But if there is a good reason to decide otherwise I'm open to it. It's just that my gut feeling is that it will get nasty to get all corner cases correct...

@codecov
Copy link

codecov bot commented Sep 30, 2024

Codecov Report

❌ Patch coverage is 96.51163% with 6 lines in your changes missing coverage. Please review.
✅ Project coverage is 89.11%. Comparing base (6d75dfb) to head (4cf0a21).

Files with missing lines Patch % Lines
pym/bob/archive.py 96.69% 4 Missing ⚠️
pym/bob/builder.py 96.55% 1 Missing ⚠️
pym/bob/cmds/build/build.py 95.45% 1 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##           master     #586      +/-   ##
==========================================
+ Coverage   89.03%   89.11%   +0.08%     
==========================================
  Files          50       50              
  Lines       16137    16290     +153     
==========================================
+ Hits        14367    14517     +150     
- Misses       1770     1773       +3     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@rhubert
Copy link
Contributor Author

rhubert commented Jan 4, 2025

I added a new implementation using the archive stuff. Since adding files to the final bundle using tarfile can not be done in parallel this finalization step is necessary. Maybe this could be optimized somehow but it seams to be impossible to synchronize this asyncio stuff. It's either unable to pickle asyncio.future objects, or the lock is generated in the wrong loop. 🤷
With this finalization method it's somehow simple and works - I don't think time is that much relevant when a bundle is packed.

To get the blackbox test working #606 is required. As of today I haven't tested bundling / unbundling a larger, real world project. I'll do this in the next days.

@rhubert rhubert force-pushed the rh-bundle branch 3 times, most recently from 9d269b4 to 059f566 Compare January 20, 2025 08:40
@rhubert rhubert changed the title WIP: bundle bundle Jan 22, 2025
@jkloetzke
Copy link
Member

I thought long about this and I have no final idea yet. So far, just some random thoughts:

  • We should probably use a zip-file. This has at least the benefit that we don't have to extract the whole bundle up-front when reading from it.
  • Your problems with asyncio come from the fact that all up- and downloads are executed in separate processes. They truly run in parallel and are forks of the original Bob pyhton process.
  • Do we really have to fail on indeterministic checkouts? What if we just ignore them?

IMHO, bundling the sources is nothing special compared to up- or downloading dist workspaces. In fact, I could imagine it would make sense to have this feature for all other transports too. Maybe not as default but as option. As such, what about defining new src-upload and src-download archive flags? If you set them explicitly for a backend, the sources are up- and downloaded.

In the end, the bundle option is just another way of defining a local source upload backend. It's just special in the way that it will create a big zip file in the end. That zip file could also hold "normal" binary artifacts. Not sure if this orthogonal design yields something useful but in principle it looks feasible.

I'm not yet sure what we should do about possible git or svn directories in source workspaces. As they are ignored when hashing the workspace we could skip them when packing the sources too...

@rhubert
Copy link
Contributor Author

rhubert commented Jan 30, 2025

  • Do we really have to fail on indeterministic checkouts? What if we just ignore them?

🤷 I don't use indeterministic checkouts so I don't have a strong opinion here. I see valid arguments for all 3 options, so maybe we should add a bundle-indeterministic=[fail,ignore,add] option to move this decision to the user?

IMHO, bundling the sources is nothing special compared to up- or downloading dist workspaces. In fact, I could imagine it would make sense to have this feature for all other transports too. Maybe not as default but as option. As such, what about defining new src-upload and src-download archive flags? If you set them explicitly for a backend, the sources are up- and downloaded.

In the end, the bundle option is just another way of defining a local source upload backend. It's just special in the way that it will create a big zip file in the end. That zip file could also hold "normal" binary artifacts. Not sure if this orthogonal design yields something useful but in principle it looks feasible.

Holding it like this we don't even need the --bundle option, right? The user only has to define a archive backend with src-upload and build with --always-checkout to get the backend populated with all the sources. The bundle-indeterminstic option would than become a src-upload-indeterministic flag?

I'm not yet sure what we should do about possible git or svn directories in source workspaces. As they are ignored when hashing the workspace we could skip them when packing the sources too...

Same as for the indeterministic packages. I think there are use cases where it could be useful to have these directories on the air-gaped machine, e.g. when development takes place there. If it's bundled just for review we can skip them, so I'd add another option here to have this configurable.

@jkloetzke
Copy link
Member

Holding it like this we don't even need the --bundle option, right? The user only has to define a archive backend with src-upload and build with --always-checkout to get the backend populated with all the sources.

I guess the --bundle option might still make sense. It will shorten the steps (define temporary backend with right flags, enable always-checkout, enable source upload).

The bundle-indeterminstic option would than become a src-upload-indeterministic flag?

Sounds reasonable. Also the source up-/download should probably be separate switches.

I'm not yet sure what we should do about possible git or svn directories in source workspaces. As they are ignored when hashing the workspace we could skip them when packing the sources too...

Same as for the indeterministic packages. I think there are use cases where it could be useful to have these directories on the air-gaped machine, e.g. when development takes place there. If it's bundled just for review we can skip them, so I'd add another option here to have this configurable.

Sounds reasonable. I think that we should throw it away by default because it will bloat the archive. But if somebody wants to retain the git-directories, why not...

@jkloetzke
Copy link
Member

I've created a 1.0 milestone for mid of April because a new release is overdue IMHO. Do you think this PR should be part of it? I'm not entirely sure because it feels the design is not really settled yet. What's your opinion?

@rhubert
Copy link
Contributor Author

rhubert commented Mar 24, 2025

I did not continue working on this to avoid conflicts with the remote archive clean PR. Mid of April (which year?) might be possible with some priority shifting on my side... but it wouldn't be a big issue if this lands after 1.0.

@jkloetzke
Copy link
Member

Mid of April (which year?)

Obviously this year. Otherwise I wouldn't have asked. 😆 😉

might be possible with some priority shifting on my side... but it wouldn't be a big issue if this lands after 1.0.

Then I would tentatively shift this feature to the next release if there is no need to hurry...

@rhubert rhubert force-pushed the rh-bundle branch 5 times, most recently from 496d465 to 276140d Compare May 18, 2025 13:41
Copy link
Member

@jkloetzke jkloetzke left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All in all I'm not sure how the interaction between bundles- and non-bundles-builds should work. That is, a previous build used one method and the next the other approach. There are 4 possibilities:

  1. Bundles build -> bundles build: The bundle was extracted in the first build. The next build will skip the checkout step.
  2. Regular build -> regular build: as always
  3. Regular build -> bundles build: should this do nothing? Obviously the bundle should not be extracted again. But what if the bundle was updated for an indeterministic checkout? We cannot overwrite source code.
  4. Bundles build -> regular build: I think we must prevent the SCMs from running. Otherwise they will immediately collide with the existing files in the workspace.

My gut feeling is that these corner cases need more thought...

Do not add packages matching (fnmatch) RE to the bundle.

``--bundle-indeterministic {yes,no,fail}``
Control how indeterministic sources are added to bundles.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you add an explanation what yes/no/fail mean?

zip-file containing all sources needed to build the package. This also enables
`--always-checkout`.

``--bundle-exclude BUNDLE_EXCLUDE``
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we make this a regex? That would be more consistent with --always-checkout and --download=packages= and --download-layer=.

[--install | --no-install]
[--sandbox | --slim-sandbox | --dev-sandbox | --strict-sandbox | --no-sandbox]
[--clean-checkout] [--attic | --no-attic]
[--bundle BUNDLE] [--bundle-exclude BUNDLE_EXCLUDE] [--unbundle]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like this needs some updates. --bundle-indeterministic and --bundle-vcs are missing.

Using source bundles
====================

A source code bundle is a tar file with all the sources needed to build a
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The whole sections needs some update too. The bundle is a zip file by now...

def canDownload(self):
return True

def canDownloadSrc(self):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No need to touch the JenkinsArchive. It is some internal backend that is only used for Jenkins jobs. As such, it will never up-/download sources.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But I need to add these methods since they are called from

 await self._cookCheckoutStep(step, depth)\n  File "/home/ralf/projects/bob/pym/bob/builder.py", line 1237, in _cookCheckoutStep
    if self.__archive.canDownloadSrc():

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right. 🤦 But it does not need to up-/download sources. So just return false here and at least you do not need to bother with uploadPackage...

if self.__bundle:
self.__temp = TemporaryDirectory(dir=os.getcwd(),
prefix=".bundle")
os.makedirs(self.__temp.name, exist_ok=True)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think it is necessary to create the directory. The TemporaryDirectory will have securely created it.

self.__srcUploadIndeterministic = spec.get("src-upload-indeterministic", "no")
spec["flags"] = ["src-upload" if self.__bundle else "src-download"]
if self.__bundle:
self.__temp = TemporaryDirectory(dir=os.getcwd(),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doesn't this leak the temporary directory if the build fails?

if self.__archive.canDownloadSrc():
audit= os.path.join(os.path.dirname(checkoutStep.getWorkspacePath()),
"audit.json.gz")
wasDownloaded = await self.__archive.downloadPackage(checkoutStep,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks very dangerous. It could overwrite existing workspaces. Downloading something must only ever happen into a freshly created workspace.

assert predicted, "Non-predicted incorrect Build-Id found!"
self.__handleChangedBuildId(checkoutStep, checkoutHash)

if self.__archive.canUploadSrc(checkoutStep):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So we would upload every time the checkout step is visited. This includes when the checkout was skipped! I think the upload should only be done after a fresh checkout, like the live built-Id. That may be different for bundles but otherwise it will create a lot of traffic with the artifact server.

@rhubert
Copy link
Contributor Author

rhubert commented May 19, 2025

All in all I'm not sure how the interaction between bundles- and non-bundles-builds should work. That is, a previous build used one method and the next the other approach. There are 4 possibilities:

1. Bundles build -> bundles build: The bundle was extracted in the first build. The next build will skip the checkout step.

2. Regular build -> regular build: as always

3. Regular build -> bundles build: should this do nothing? Obviously the bundle should not be extracted again. But what if the bundle was updated for an indeterministic checkout? We cannot overwrite source code.

I'd say deterministic packages will stay as they are as long as they are not changed. For indeterministic packages or if the checkout was updated there might be some attic logic needed...

4. Bundles build -> regular build: I think we must prevent the SCMs from running. Otherwise they will immediately collide with the existing files in the workspace.

IMO we also need to move the workspace to attic and run the scm-checkout.

My gut feeling is that these corner cases need more thought...

I don't have a use-case for anything but 1 and 2. Maybe we can just forbid 3 and 4 somehow until there is a need for it?

BTW - I'm doing some testing with a real project ATM. Besides two packages triggering the "Hash doesn't match after unbundle" error (I'm look into it) it was necessary to disable the ci.bobbuildtool.dev archive in basement on my air gapped machine. Otherwise I got
BuildError: http://ci.bobbuildtool.dev/artifacts/: Cannot download file: [Errno -3] Temporary failure in name resolution
raised on the first QUERY step. Do we really need to raise a build error on OsErrors in downloadPackage?

Also it was necessary to change the layers from managed, ...

@jkloetzke
Copy link
Member

The failure on name resolution errors should be fixed with the current master. Does it still happen with the latest version? They should be treated like the artifact was not found by now.

@rhubert
Copy link
Contributor Author

rhubert commented May 20, 2025

Yes, this happens on master + bundle - basically the current state of this PR.

-> https://github.com/BobBuildTool/bob/blob/master/pym/bob/archive.py#L459

I did a quick test on my system and it seams the handling of socket.gaierror is missing there...

@jkloetzke
Copy link
Member

Yes, this happens on master + bundle - basically the current state of this PR.

-> https://github.com/BobBuildTool/bob/blob/master/pym/bob/archive.py#L459

Grmpf. You looked (like me) at the wrong spot because the error is handled 5 lines above. But there is another path for live build-id. Hopefully fixed with 5c7e356.

@jkloetzke
Copy link
Member

I'd say deterministic packages will stay as they are as long as they are not changed. For indeterministic packages or if the checkout was updated there might be some attic logic needed...

Sounds good. So whenever we change between bundle-/non-bundle mode the sources will be moved to the attic.

I'm unsure what to do of the indeterministic bundle packages. Is this a use case on your side? If not, I would forbid them.

@rhubert rhubert marked this pull request as draft June 9, 2025 09:24
@rhubert
Copy link
Contributor Author

rhubert commented Jun 10, 2025

All in all I'm not sure how the interaction between bundles- and non-bundles-builds should work.

I think we have the same question/issue for regular src-upload / src-download vs. scm-checkouts? If the sources are downloaded in first place but couldn't be downloaded after a recipe change we can not simply run the regular checkout in this workspace. (Also extracting the new sources in the same workspace could lead to a different result.)

So every time a checkout should run we need to move modified existing sources to attic and run the checkout in a fresh workspace. Unmodified sources can be deleted? Or do I miss something?

@jkloetzke
Copy link
Member

So every time a checkout should run we need to move modified existing sources to attic and run the checkout in a fresh workspace.

Right. Sounds like the safest and easiest option. We should prevent re-runs for downloaded indeterministic checkouts though. Unless the bundle is changed, it will yield the same result and just create churn.

BTW, are indeterministic bundled checkouts needed?

Unmodified sources can be deleted? Or do I miss something?

In principle yes. Currently we don't make a checksum before the update. That would be a new case for downloaded source workspaces. I wouldn't mind if we just consistently move the workspace to the attic instead if that turns out to be hard to implement.

@rhubert
Copy link
Contributor Author

rhubert commented Jun 12, 2025

BTW, are indeterministic bundled checkouts needed?

I don't have a usecase for them

rhubert and others added 11 commits September 25, 2025 17:56
This method is called by the builder once the build has been finished
successfully.
The bundle archive is for up/download of sources only. It uses a
temporary directory in the workspace to collect all sources needed to
build the package. The temporary directory is in the workspace since the
typical temp-dir localtion might not have enought free space.
The finish method puts all the sources in the temp-dir into an uncomporessed
zip file.

When downloading sources from this Archive the sources are extracted
from the zip-file again.
And use the builder functions to enable and finish bundling.
Since 27d791f the error codes of subprocesses are returned if such a
command fails. Make expect_fail happy with anything but zero.
Move the affected sources to attic whenever the bundle state changes.
@rhubert
Copy link
Contributor Author

rhubert commented Sep 25, 2025

I found some time to continue on this, but now I'm stuck. For the tests I'd like to test that we only bundle sources as they are referenced by the recipes without any local modifications. But for url-scms the clean-checkout has no effect? (https://github.com/BobBuildTool/bob/pull/586/files#diff-35d1d446d0498b0359b9232bd9c7ec4cbe21a206834fd893aa26b28fdeab20fdR125)

@jkloetzke
Copy link
Member

But for url-scms the clean-checkout has no effect?

No, it doesn't. Bob relies on git and svn to tell us when something was changed. For the URL SCM, no state is tracked internally. :/

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants