Several times over the years we've considered adding tracing
instrumentation to Terraform, since even when running in isolation as a
CLI program it has a "distributed system-like" structure, with lots of
concurrent internal work and also some work delegated to provider plugins
that are essentially temporarily-running microservices.
However, it's always felt a bit overwhelming to do it because much of
Terraform predates the Go context.Context idiom and so it's tough to get
a clean chain of context.Context values all the way down the stack without
disturbing a lot of existing APIs.
This commit aims to just get that process started by establishing how a
context can propagate from "package main" into the command package,
focusing initially on "terraform init" and some other commands that share
some underlying functions with that command.
OpenTelemetry has emerged as a de-facto industry standard and so this uses
its API directly, without any attempt to hide it behind an abstraction.
The OpenTelemetry API is itself already an adapter layer, so we should be
able to swap in any backend that uses comparable concepts. For now we just
discard the tracing reports by default, and allow users to opt in to
delivering traces over OTLP by setting an environment variable when
running Terraform (the environment variable was established in an earlier
commit, so this commit builds on that.)
When tracing collection is enabled, every Terraform CLI run will generate
at least one overall span representing the command that was run. Some
commands might also create child spans, but most currently do not.
Currently Terraform will use an entry from the global plugin cache only if
it matches a checksum already recorded in the dependency lock file. This
allows Terraform to produce a complete lock file entry on the first
encounter with a new provider, whereas using the cache in that case would
cause the lock file to only cover the single package in the cache and
thereefore be unusable on any other operating system or CPU architecture.
This temporary CLI config option is a pragmatic exception to support those
who cannot currently correctly use the dependency lock file but who still
want to benefit from the plugin cache. With this setting enabled,
Terraform has permission to produce a dependency lock file that is only
suitable for the current system if that would allow use of an existing
entry in the plugin cache.
We are introducing this option to resolve a conflict between the needs of
folks who are using the dependency lock file as expected and the needs of
folks who cannot use the dependency lock file for some reason. The hope
then is to give respite to those who need this exception in the meantime
while we understand better why they cannot use the dependency lock file
and improve its design so that everyone will be able to use it
successfully in a future version of Terraform. This option will become a
silent no-op in a future version of Terraform, once the dependency lock
file behavior is sufficient for all supported Terraform development
workflows.
We originally introduced the idea of language experiments as a way to get
early feedback on not-yet-proven feature ideas, ideally as part of the
initial exploration of the solution space rather than only after a
solution has become relatively clear.
Unfortunately, our tradeoff of making them available in normal releases
behind an explicit opt-in in order to make it easier to participate in the
feedback process had the unintended side-effect of making it feel okay
to use experiments in production and endure the warnings they generate.
This in turn has made us reluctant to make use of the experiments feature
lest experiments become de-facto production features which we then feel
compelled to preserve even though we aren't yet ready to graduate them
to stable features.
In an attempt to tweak that compromise, here we make the availability of
experiments _at all_ a build-time flag which will not be set by default,
and therefore experiments will not be available in most release builds.
The intent (not yet implemented in this PR) is for our release process to
set this flag only when it knows it's building an alpha release or a
development snapshot not destined for release at all, which will therefore
allow us to still use the alpha releases as a vehicle for giving feedback
participants access to a feature (without needing to install a Go
toolchain) but will not encourage pretending that these features are
production-ready before they graduate from experimental.
Only language experiments have an explicit framework for dealing with them
which outlives any particular experiment, so most of the changes here are
to that generalized mechanism. However, the intent is that non-language
experiments, such as experimental CLI commands, would also in future
check Meta.AllowExperimentalFeatures and gate the use of those experiments
too, so that we can be consistent that experimental features will never
be available unless you explicitly choose to use an alpha release or
a custom build from source code.
Since there are already some experiments active at the time of this commit
which were not previously subject to this restriction, we'll pragmatically
leave those as exceptions that will remain generally available for now,
and so this new approach will apply only to new experiments started in the
future. Once those experiments have all concluded, we will be left with
no more exceptions unless we explicitly choose to make an exception for
some reason we've not imagined yet.
It's important that we be able to write tests that rely on experiments
either being available or not being available, so here we're using our
typical approach of making "package main" deal with the global setting
that applies to Terraform CLI executables while making the layers below
all support fine-grain selection of this behavior so that tests with
different needs can run concurrently without trampling on one another.
As a compromise, the integration tests in the terraform package will
run with experiments enabled _by default_ since we commonly need to
exercise experiments in those tests, but they can selectively opt-out
if they need to by overriding the loader setting back to false again.
The init command needs to initialize a backend, in order to access
state, in turn to derive provider requirements from state. The backend
initialization step requires building provider factories, which
previously would fail if a lockfile was present without a corresponding
local provider cache.
This commit ensures that in this situation only, errors with the
provider factories are temporarily ignored. This allows us to continue
to initialize the backend, fetch providers, and then report any errors
as necessary.
In the original incarnation of Meta.providerFactories we were returning
into a Meta.contextOpts whose signature didn't allow it to return an
error directly, and so we had compromised by making the provider factory
functions themselves return errors once called.
Subsequent work made Meta.contextOpts need to return an error anyway, but
at the time we neglected to update our handling of the providerFactories
result, having it still defer the error handling until we finally
instantiate a provider.
Although that did ultimately get the expected result anyway, the error
ended up being reported from deep in the guts of a Terraform Core graph
walk, in whichever concurrently-visited graph node happened to try to
instantiate the plugin first. This meant that the exact phrasing of the
error message would vary between runs and the reporting codepath didn't
have enough context to given an actionable suggestion on how to proceed.
In this commit we make Meta.contextOpts pass through directly any error
that Meta.providerFactories produces, and then make Meta.providerFactories
produce a special error type so that Meta.Backend can ultimately return
a user-friendly diagnostic message containing a specific suggestion to
run "terraform init", along with a short explanation of what a provider
plugin is.
The reliance here on an implied contract between two functions that are
not directly connected in the callstack is non-ideal, and so hopefully
we'll revisit this further in future work on the overall architecture of
the CLI layer. To try to make this robust in the meantime though, I wrote
it to use the errors.As function to potentially unwrap a wrapped version
of our special error type, in case one of the intervening layers is
changed at some point to wrap the downstream error before returning it.
Historically the responsibility for making sure that all of the available
providers are of suitable versions and match the appropriate checksums has
been split rather inexplicably over multiple different layers, with some
of the checks happening as late as creating a terraform.Context.
We're gradually iterating towards making that all be handled in one place,
but in this step we're just cleaning up some old remnants from the
main "terraform" package, which is now no longer responsible for any
version or checksum verification and instead just assumes it's been
provided with suitable factory functions by its caller.
We do still have a pre-check here to make sure that we at least have a
factory function for each plugin the configuration seems to depend on,
because if we don't do that up front then it ends up getting caught
instead deep inside the Terraform runtime, often inside a concurrent
graph walk and thus it's not deterministic which codepath will happen to
catch it on a particular run.
As of this commit, this actually does leave some holes in our checks: the
command package is using the dependency lock file to make sure we have
exactly the provider packages we expect (exact versions and checksums),
which is the most crucial part, but we don't yet have any spot where
we make sure that the lock file is consistent with the current
configuration, and we are no longer preserving the provider checksums as
part of a saved plan.
Both of those will come in subsequent commits. While it's unusual to have
a series of commits that briefly subtracts functionality and then adds
back in equivalent functionality later, the lock file checking is the only
part that's crucial for security reasons, with everything else mainly just
being to give better feedback when folks seem to be using Terraform
incorrectly. The other bits are therefore mostly cosmetic and okay to be
absent briefly as we work towards a better design that is clearer about
where that responsibility belongs.
Thus far our various interactions with the bits of state we keep
associated with a working directory have all been implemented directly
inside the "command" package -- often in the huge command.Meta type -- and
not managed collectively via a single component.
There's too many little codepaths reading and writing from the working
directory and data directory to refactor it all in one step, but this is
an attempt at a first step towards a future where everything that reads
and writes from the current working directory would do so via an object
that encapsulates the implementation details and offers a high-level API
to read and write all of these session-persistent settings.
The design here continues our gradual path towards using a dependency
injection style where "package main" is solely responsible for directly
interacting with the OS command line, the OS environment, the OS working
directory, the stdio streams, and the CLI configuration, and then
communicating the resulting information to the rest of Terraform by wiring
together objects. It seems likely that eventually we'll have enough wiring
code in package main to justify a more explicit organization of that code,
but for this commit the new "workdir.Dir" object is just wired directly in
place of its predecessors, without any significant change of code
organization at that top layer.
This first commit focuses on the main files and directories we use to
find provider plugins, because a subsequent commit will lightly reorganize
the separation of concerns for plugin launching with a similar goal of
collecting all of the relevant logic together into one spot.
Previously terraform.Context was built in an unfortunate way where all of
the data was provided up front in terraform.NewContext and then mutated
directly by subsequent operations. That made the data flow hard to follow,
commonly leading to bugs, and also meant that we were forced to take
various actions too early in terraform.NewContext, rather than waiting
until a more appropriate time during an operation.
This (enormous) commit changes terraform.Context so that its fields are
broadly just unchanging data about the execution context (current
workspace name, available plugins, etc) whereas the main data Terraform
works with arrives via individual method arguments and is returned in
return values.
Specifically, this means that terraform.Context no longer "has-a" config,
state, and "planned changes", instead holding on to those only temporarily
during an operation. The caller is responsible for propagating the outcome
of one step into the next step so that the data flow between operations is
actually visible.
However, since that's a change to the main entry points in the "terraform"
package, this commit also touches every file in the codebase which
interacted with those APIs. Most of the noise here is in updating tests
to take the same actions using the new API style, but this also affects
the main-code callers in the backends and in the command package.
My goal here was to refactor without changing observable behavior, but in
practice there are a couple externally-visible behavior variations here
that seemed okay in service of the broader goal:
- The "terraform graph" command is no longer hooked directly into the
core graph builders, because that's no longer part of the public API.
However, I did include a couple new Context functions whose contract
is to produce a UI-oriented graph, and _for now_ those continue to
return the physical graph we use for those operations. There's no
exported API for generating the "validate" and "eval" graphs, because
neither is particularly interesting in its own right, and so
"terraform graph" no longer supports those graph types.
- terraform.NewContext no longer has the responsibility for collecting
all of the provider schemas up front. Instead, we wait until we need
them. However, that means that some of our error messages now have a
slightly different shape due to unwinding through a differently-shaped
call stack. As of this commit we also end up reloading the schemas
multiple times in some cases, which is functionally acceptable but
likely represents a performance regression. I intend to rework this to
use caching, but I'm saving that for a later commit because this one is
big enough already.
The proximal reason for this change is to resolve the chicken/egg problem
whereby there was previously no single point where we could apply "moved"
statements to the previous run state before creating a plan. With this
change in place, we can now do that as part of Context.Plan, prior to
forking the input state into the three separate state artifacts we use
during planning.
However, this is at least the third project in a row where the previous
API design led to piling more functionality into terraform.NewContext and
then working around the incorrect order of operations that produces, so
I intend that by paying the cost/risk of this large diff now we can in
turn reduce the cost/risk of future projects that relate to our main
workflow actions.
This is part of a general effort to move all of Terraform's non-library
package surface under internal in order to reinforce that these are for
internal use within Terraform only.
If you were previously importing packages under this prefix into an
external codebase, you could pin to an earlier release tag as an interim
solution until you've make a plan to achieve the same functionality some
other way.
This is part of a general effort to move all of Terraform's non-library
package surface under internal in order to reinforce that these are for
internal use within Terraform only.
If you were previously importing packages under this prefix into an
external codebase, you could pin to an earlier release tag as an interim
solution until you've make a plan to achieve the same functionality some
other way.
This is part of a general effort to move all of Terraform's non-library
package surface under internal in order to reinforce that these are for
internal use within Terraform only.
If you were previously importing packages under this prefix into an
external codebase, you could pin to an earlier release tag as an interim
solution until you've make a plan to achieve the same functionality some
other way.