From 3e1e426dff603dffe6609ef502b47cc04982942f Mon Sep 17 00:00:00 2001 From: Christian Mesh Date: Fri, 23 Aug 2024 09:10:55 -0400 Subject: [PATCH 1/4] Add RFC for global provider cache locking Signed-off-by: Christian Mesh --- rfc/20240824-provider-cache-locking.md | 68 ++++++++++++++++++++++++++ 1 file changed, 68 insertions(+) create mode 100644 rfc/20240824-provider-cache-locking.md diff --git a/rfc/20240824-provider-cache-locking.md b/rfc/20240824-provider-cache-locking.md new file mode 100644 index 0000000000..56169505ba --- /dev/null +++ b/rfc/20240824-provider-cache-locking.md @@ -0,0 +1,68 @@ +# Safe Global Provider Cache Access + +Issue: https://github.com/OpenTofu/opentofu/issues/1483 + +Many CI/CD systems download all providers for every single tofu action run. The global provider cache allows downloaded providers to be saved in between runs and linked into the local provider cache (in .terraform/). Unfortunately, the global provider cache does not have any locking around it and can fail when used in multiple simultaneous runs. + +Another common use case is terragrunt. When running `tofu init` on multiple projects simultaneously via the tool, there is a high likely hood of a conflicting write to a provider's directory. This has caused some interesting tooling workarounds on their end, which are not ideal. + +Additionally, as we build out true e2e tests for OpenTofu, safe access to the global provider cache can dramatically reduce run times. + +## Proposed Solution + +A filesystem level lock should be added that is safe for cross-process and in some scenarios cross-machine access. It will be best-effort and should rely on standard locking practices (not home grown) and be transparent to the user. + +### User Documentation + +By setting `TF_PLUGIN_CACHE_DIR` or `plugin_cache_dir` to a directory, tofu will download all required providers to that directory during init. If providers already exist within that directory, the lock file will be checked and the download may be skipped. Once the global cache is populated, providers will be linked into the local provider cache. This can save significant time/resources across multiple runs/projects. + +Example: +``` +$ TF_PLUGIN_CACHE_DIR=~/.tofu.d/plugin-cache/ tofu init +Initializing provider plugins... +- Finding latest version of hashicorp/local... +- Installing hashicorp/local v2.5.1... + +$ rm .terraform/ -r +$ TF_PLUGIN_CACHE_DIR=~/.tofu.d/plugin-cache/ tofu init +Initializing provider plugins... +- Reusing previous version of hashicorp/local from the dependency lock file +- Using hashicorp/local v2.5.1 from the shared cache directory +``` + +Any number of tofu instances should be able to be run in parallel and still have safe access to the global cache directory. + +#### Scenarios which currently cause failures + +Multiple init/plans/apply without a primed cache. + +One possibility is that each project downloads and overwrites the same provider files. The global provider cache scan is not a fast process and should be run as little as possible, it is therefore only run at the start of `tofu init`. ProjectA and ProjectB's inits may be run at the same time. When this occurs, they will both download the full set of providers that they need and clobber each others files. This is not ideal, but is only a waste of time/resources. + +Another possibility is that one project may already be planning/applying and have a live provider executable running, effectively locking it on disk. If ProjectA is still downloading providers during init and ProjectB has a much smaller list and is already in plan, ProjectA may attempt to overwrite the provider binary that ProjectB is running and maintaining an execution lock on. + + +Mixture of missing platform/corrupt lock files with valid lock files. + +The provider lock file may have been generated on a system with a different architecture or may have become corrupt due to a variety of reasons. When this happens, the global cache may not match the lock file and force a re-download of the provider. This will cause potentially unexpected downloads in the above scenarios. + +### Technical Approach + +This proposal hinges on a stable and consistent cross-platform lock. The codebase already contains this in the form of locking local state files. This code is battle tested and overall quite simple to use. With some light refactoring, this filesystem locking code can be moved into it's own internal package and used both to lock providers and to lock the local state file. + +The existing file locking should be safe on any local filesystems, but should be used with caution on shared volumes such as legacy NFS shares who do not provide strong locking consistency. + +Within the provider installation code, the whole section which inspects and links to the current providers available should be locked at the provider level. The lock should be done at this granularity specifically for the complicating factors mentioned above. + +Additionally, we should make the package installer smarter and able to check the files in the cache against the downloaded version. This prevents a bad cache entry from overwriting a valid entry which may already been in use by another process. + +### Open Questions + +* Is fnctl flock safe enough? It is industry standard and battle tested. Are there any additional caveats that should be mentioned in the docs? + +### Future Considerations + +A considerable amount of time is spent scanning the provider cache. Instead, it should probably be refactored to only read the provider metadata necessary at any given time. This will offer some significant performance bonuses on large configurations. + +## Potential Alternatives + +Terragrunt/Gruntworks provides a http provider mirror that can be run locally and maintains a provider cache (archives) external to OpenTofu itself. While this does have some advantages, it still requires each provider to be extracted into the local provider cache folder, taking additional time and space. It may be a safer alternative to using a networked filesystem between systems. From ebf7dae4c4ca28664ca68144b66837423e84f856 Mon Sep 17 00:00:00 2001 From: Christian Mesh Date: Thu, 3 Oct 2024 15:17:29 -0400 Subject: [PATCH 2/4] Respond to PR Comments Signed-off-by: Christian Mesh --- rfc/20240824-provider-cache-locking.md | 7 ++++--- 1 file changed, 4 insertions(+), 3 deletions(-) diff --git a/rfc/20240824-provider-cache-locking.md b/rfc/20240824-provider-cache-locking.md index 56169505ba..5d84c6ee91 100644 --- a/rfc/20240824-provider-cache-locking.md +++ b/rfc/20240824-provider-cache-locking.md @@ -10,7 +10,7 @@ Additionally, as we build out true e2e tests for OpenTofu, safe access to the gl ## Proposed Solution -A filesystem level lock should be added that is safe for cross-process and in some scenarios cross-machine access. It will be best-effort and should rely on standard locking practices (not home grown) and be transparent to the user. +A filesystem level lock via native syscall (fcntl flock in POSIX / LockFileEx in Windows) should be added that is safe for cross-process and in some scenarios cross-machine access. It will be best-effort and should rely on standard locking practices (not home grown) and be transparent to the user. ### User Documentation @@ -49,7 +49,7 @@ The provider lock file may have been generated on a system with a different arch This proposal hinges on a stable and consistent cross-platform lock. The codebase already contains this in the form of locking local state files. This code is battle tested and overall quite simple to use. With some light refactoring, this filesystem locking code can be moved into it's own internal package and used both to lock providers and to lock the local state file. -The existing file locking should be safe on any local filesystems, but should be used with caution on shared volumes such as legacy NFS shares who do not provide strong locking consistency. +The existing file locking should be safe on any local filesystems, but should be used with caution on shared volumes such as legacy NFS shares who do not provide strong locking consistency. We will add explicit warnings to the documentation which recommend against using networked filesystems for this use case. Within the provider installation code, the whole section which inspects and links to the current providers available should be locked at the provider level. The lock should be done at this granularity specifically for the complicating factors mentioned above. @@ -57,7 +57,8 @@ Additionally, we should make the package installer smarter and able to check the ### Open Questions -* Is fnctl flock safe enough? It is industry standard and battle tested. Are there any additional caveats that should be mentioned in the docs? +* Is a filesystem lock syscall safe enough? It is industry standard and battle tested. Are there any additional caveats that should be mentioned in the docs? + - Yes. As mentioned above, we should document that networked filesystems are not recommended for this use case. ### Future Considerations From 48947e5df6fb948e8002588ddf766c250271297f Mon Sep 17 00:00:00 2001 From: Christian Mesh Date: Thu, 3 Oct 2024 15:18:04 -0400 Subject: [PATCH 3/4] Update rfc/20240824-provider-cache-locking.md Co-authored-by: James Humphries Signed-off-by: Christian Mesh --- rfc/20240824-provider-cache-locking.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/rfc/20240824-provider-cache-locking.md b/rfc/20240824-provider-cache-locking.md index 5d84c6ee91..0257bd4d57 100644 --- a/rfc/20240824-provider-cache-locking.md +++ b/rfc/20240824-provider-cache-locking.md @@ -6,7 +6,7 @@ Many CI/CD systems download all providers for every single tofu action run. The Another common use case is terragrunt. When running `tofu init` on multiple projects simultaneously via the tool, there is a high likely hood of a conflicting write to a provider's directory. This has caused some interesting tooling workarounds on their end, which are not ideal. -Additionally, as we build out true e2e tests for OpenTofu, safe access to the global provider cache can dramatically reduce run times. +Additionally, as we build out true e2e tests for OpenTofu, safe access to the global provider cache can dramatically reduce run times by allowing us to run these concurrently. ## Proposed Solution From dc25384732bbe2906aa63c8f77b0e6878fd1d56d Mon Sep 17 00:00:00 2001 From: Christian Mesh Date: Thu, 3 Oct 2024 15:23:41 -0400 Subject: [PATCH 4/4] Fix spelling (sorry Gruntwork!) Signed-off-by: Christian Mesh --- rfc/20240824-provider-cache-locking.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/rfc/20240824-provider-cache-locking.md b/rfc/20240824-provider-cache-locking.md index 0257bd4d57..230b40537e 100644 --- a/rfc/20240824-provider-cache-locking.md +++ b/rfc/20240824-provider-cache-locking.md @@ -66,4 +66,4 @@ A considerable amount of time is spent scanning the provider cache. Instead, it ## Potential Alternatives -Terragrunt/Gruntworks provides a http provider mirror that can be run locally and maintains a provider cache (archives) external to OpenTofu itself. While this does have some advantages, it still requires each provider to be extracted into the local provider cache folder, taking additional time and space. It may be a safer alternative to using a networked filesystem between systems. +Terragrunt/Gruntwork provides a http provider mirror that can be run locally and maintains a provider cache (archives) external to OpenTofu itself. While this does have some advantages, it still requires each provider to be extracted into the local provider cache folder, taking additional time and space. It may be a safer alternative to using a networked filesystem between systems.