Historically the responsibility for making sure that all of the available
providers are of suitable versions and match the appropriate checksums has
been split rather inexplicably over multiple different layers, with some
of the checks happening as late as creating a terraform.Context.
We're gradually iterating towards making that all be handled in one place,
but in this step we're just cleaning up some old remnants from the
main "terraform" package, which is now no longer responsible for any
version or checksum verification and instead just assumes it's been
provided with suitable factory functions by its caller.
We do still have a pre-check here to make sure that we at least have a
factory function for each plugin the configuration seems to depend on,
because if we don't do that up front then it ends up getting caught
instead deep inside the Terraform runtime, often inside a concurrent
graph walk and thus it's not deterministic which codepath will happen to
catch it on a particular run.
As of this commit, this actually does leave some holes in our checks: the
command package is using the dependency lock file to make sure we have
exactly the provider packages we expect (exact versions and checksums),
which is the most crucial part, but we don't yet have any spot where
we make sure that the lock file is consistent with the current
configuration, and we are no longer preserving the provider checksums as
part of a saved plan.
Both of those will come in subsequent commits. While it's unusual to have
a series of commits that briefly subtracts functionality and then adds
back in equivalent functionality later, the lock file checking is the only
part that's crucial for security reasons, with everything else mainly just
being to give better feedback when folks seem to be using Terraform
incorrectly. The other bits are therefore mostly cosmetic and okay to be
absent briefly as we work towards a better design that is clearer about
where that responsibility belongs.
Our current implementation of destroy planning includes secretly running a
normal plan first, in order to get its effect of refreshing the state.
Previously our warning about colliding moves would betray that
implementation detail because we'd return it from both of our planning
operations here and thus show the message twice. That would also have
happened in theory for any other warnings emitted by both plan operations,
but it's the move collision warning that made it immediately visible.
We'll now only return warnings from the initial plan if we're also
returning errors from that plan, and thus the warnings of both plans can
never mix together into the same diags and thus we'll avoid duplicating
any warnings.
This does mean that we'd lose any warnings which might hypothetically
emerge only from the hidden normal plan and not from the subsequent
destroy plan, but we'll accept that as an okay tradeoff here because those
warnings are likely to not be super relevant to the destroy case anyway,
or else we'd emit them from the destroy-plan walk too.
Because our validation rules depend on some dynamic results produced by
actually running the plan, we deal with moves in a "backwards" order where
we try to apply them first -- ignoring anything strange we might find --
and then validate the original statements only after planning.
An unfortunate consequence of that approach is that when the move
statements are invalid it's likely that move execution will not fully
complete, and so the generated plan is likely to be incorrect and might
well include errors resulting from the unresolved moves.
To mitigate that, here we let any move validation errors supersede all
other diagnostics that the plan phase might've generated, in the hope that
it'll help the user focus on fixing the incorrect move statements without
creating confusing by reporting errors that only appeared as a quick of
how Terraform worked around the invalid move statements earlier.
In most cases Terraform will be able to automatically fully resolve all
of the pending move statements before creating a plan, but there are some
edge cases where we can end up wanting to move one object to a location
where another object is already declared.
One relatively-obvious example is if someone uses "terraform state mv" in
order to create a set of resource instance bindings that could never have
arising in normal Terraform use.
A less obvious example arises from the interactions between moves at
different levels of granularity. If we are both moving a module to a new
address and moving a resource into an instance of the new module at the
same time, the old module might well have already had a resource of the
same name and so the resource move will be unresolvable.
In these situations Terraform will move the objects as far as possible,
but because it's never valid for a move "from" address to still be
declared in the configuration Terraform will inevitably always plan to
destroy the objects that didn't find a final home. To give some additional
explanation for that result, here we'll add a warning which describes
what happened.
This is not a particularly actionable warning because we don't really
have enough information to guess what the user intended, but we do at
least prompt that they might be able to use the "terraform state" family
of subcommands to repair the ambiguous situation before planning, if they
want a different result than what Terraform proposed.
When we originally stubbed ApplyMoves we didn't know yet how exactly we'd
be using the result, so we made it a double-indexed map allowing looking
up moves in both directions.
However, in practice we only actually need to look up old addresses by new
addresses, and so this commit first removes the double indexing so that
each move is only represented by one element in the map.
We also need to describe situations where a move was blocked, because in
a future commit we'll generate some warnings in those cases. Therefore
ApplyMoves now returns a MoveResults object which contains both a map of
changes and a map of blocks. The map of blocks isn't used yet as of this
commit, but we'll use it in a later commit to produce warnings within
the "terraform" package.
Going back a long time we've had a special magic behavior which tries to
recognize a situation where a module author either added or removed the
"count" argument from a resource that already has instances, and to
silently rename the zeroth or no-key instance so that we don't plan to
destroy and recreate the associated object.
Now we have a more general idea of "move statements", and specifically
the idea of "implied" move statements which replicates the same heuristic
we used to use for this behavior, we can treat this magic renaming rule as
just another "move statement", special only in that Terraform generates it
automatically rather than it being written out explicitly in the
configuration.
In return for wiring that in, we can now remove altogether the
NodeCountBoundary graph node type and its associated graph transformer,
CountBoundaryTransformer. We handle moves as a preprocessing step before
building the plan graph, so we no longer need to include any special nodes
in the graph to deal with that situation.
The test updates here are mainly for the graph builders themselves, to
acknowledge that indeed we're no longer inserting the NodeCountBoundary
vertices. The vertices that NodeCountBoundary previously depended on now
become dependencies of the special "root" vertex, although in many cases
here we don't see that explicitly because of the transitive reduction
algorithm, which notices when there's already an equivalent indirect
dependency chain and removes the redundant edge.
We already have plenty of test coverage for these "count boundary" cases
in the context tests whose names start with TestContext2Plan_count and
TestContext2Apply_resourceCount, all of which continued to pass here
without any modification and so are not visible in the diff. The test
functions particularly relevant to this situation are:
- TestContext2Plan_countIncreaseFromNotSet
- TestContext2Plan_countDecreaseToOne
- TestContext2Plan_countOneIndex
- TestContext2Apply_countDecreaseToOneCorrupted
The last of those in particular deals with the situation where we have
both a no-key instance _and_ a zero-key instance in the prior state, which
is interesting here because to exercises an intentional interaction
between refactoring.ImpliedMoveStatements and refactoring.ApplyMoves,
where we intentionally generate an implied move statement that produces
a collision and then expect ApplyMoves to deal with it in the same way as
it would deal with all other collisions, and thus ensure we handle both
the explicit and implied collisions in the same way.
This does affect some UI-level tests, because a nice side-effect of this
new treatment of this old feature is that we can now report explicitly
in the UI that we're assigning new addresses to these objects, whereas
before we just said nothing and hoped the user would just guess what had
happened and why they therefore weren't seeing a diff.
The backend/local plan tests actually had a pre-existing bug where they
were using a state with a different instance key than the config called
for but getting away with it because we'd previously silently fix it up.
That's still fixed up, but now done with an explicit mention in the UI
and so I made the state consistent with the configuration here so that the
tests would be able to recognize _real_ differences where present, as
opposed to the errant difference caused by that inconsistency.
The set of drifted resources now includes move-only changes, where the
object value is identical but a move has been executed. In normal
operation, we previousl displayed these moves twice: once as part of
drift output, and once as part of planned changes.
As of this commit we omit move-only changes from drift display, except
for refresh-only plans. This fixes the redundant output.
Previously, drifted resources included only updates and deletes. To
correctly display the full changes which would result as part of a
refresh-only apply, the drifted resources must also include move-only
changes.
Rather than delaying resource drift detection until it is ready to be
presented, here we perform that computation after the plan walk has
completed. The resulting drift is represented like planned resource
changes, using a slice of ResourceInstanceChangeSrc values.
Because "moved" blocks produce changes that span across more than one
resource instance address at the same time, we need to take extra care
with them during planning.
The -target option allows for restricting Terraform's attention only to
a subset of resources when planning, as an escape hatch to recover from
bugs and mistakes.
However, we need to avoid any situation where only one "side" of a move
would be considered in a particular plan, because that'd create a new
situation that would be otherwise unreachable and would be difficult to
recover from.
As a compromise then, we'll reject an attempt to create a targeted plan if
the plan involves resolving a pending move and if the source address of
that move is not included in the targets.
Our error message offers the user two possible resolutions: to create an
untargeted plan, thus allowing everything to resolve, or to add additional
-target options to include just the existing resource instances that have
pending moves to resolve.
This compromise recognizes that it is possible -- though hopefully rare --
that a user could potentially both be recovering from a bug or mistake at
the same time as processing a move, if e.g. the bug was fixed by upgrading
a module and the new version includes a new "moved" block. In that edge
case, it might be necessary to just add the one additional address to
the targets rather than removing the targets altogether, if creating a
normal untargeted plan is impossible due to whatever bug they're trying to
recover from.
Previously our graph walker expected to recieve a data structure
containing schemas for all of the provider and provisioner plugins used in
the configuration and state. That made sense back when
terraform.NewContext was responsible for loading all of the schemas before
taking any other action, but it no longer has that responsiblity.
Instead, we'll now make sure that the "contextPlugins" object reaches all
of the locations where we need schema -- many of which already had access
to that object anyway -- and then load the needed schemas just in time.
The contextPlugins object memoizes schema lookups, so we can safely call
it many times with the same provider address or provisioner type name and
know that it'll still only load each distinct plugin once per Context
object.
As of this commit, the Context.Schemas method is now a public interface
only and not used by logic in the "terraform" package at all. However,
that does leave us in a rather tenuous situation of relying on the fact
that all practical users of terraform.Context end up calling "Schemas" at
some point in order to verify that we have all of the expected versions
of plugins. That's a non-obvious implicit dependency, and so in subsequent
commits we'll gradually move all responsibility for verifying plugin
versions into the caller of terraform.NewContext, which'll heal a
long-standing architectural wart whereby the caller is responsible for
installing and locating the plugin executables but not for verifying that
what's installed is conforming to the current configuration and dependency
lock file.
Previously the graph builders all expected to be given a full manifest
of all of the plugin component schemas that they could need during their
analysis work. That made sense when terraform.NewContext would always
proactively load all of the schemas before doing any other work, but we
now have a load-as-needed strategy for schemas.
We'll now have the graph builders use the contextPlugins object they each
already hold to retrieve individual schemas when needed. This avoids the
need to prepare a redundant data structure to pass alongside the
contextPlugins object, and leans on the memoization behavior inside
contextPlugins to preserve the old behavior of loading each provider's
schema only once.
In the v0.12 timeframe we made contextComponentFactory an interface with
the expectation that we'd write mocks of it for tests, but in practice we
ended up just always using the same "basicComponentFactory" implementation
throughout.
In the interests of simplification then, here we replace that interface
and its sole implementation with a new concrete struct type
contextPlugins.
Along with the general benefit that this removes an unneeded indirection,
this also means that we can add additional methods to the struct type
without the usual restriction that interface types prefer to be small.
In particular, in a future commit I'm planning to add methods for loading
provider and provisioner schemas, working with the currently-unused new
fields this commit has included in contextPlugins, as compared to its
predecessor basicComponentFactory.
Previously terraform.Context was built in an unfortunate way where all of
the data was provided up front in terraform.NewContext and then mutated
directly by subsequent operations. That made the data flow hard to follow,
commonly leading to bugs, and also meant that we were forced to take
various actions too early in terraform.NewContext, rather than waiting
until a more appropriate time during an operation.
This (enormous) commit changes terraform.Context so that its fields are
broadly just unchanging data about the execution context (current
workspace name, available plugins, etc) whereas the main data Terraform
works with arrives via individual method arguments and is returned in
return values.
Specifically, this means that terraform.Context no longer "has-a" config,
state, and "planned changes", instead holding on to those only temporarily
during an operation. The caller is responsible for propagating the outcome
of one step into the next step so that the data flow between operations is
actually visible.
However, since that's a change to the main entry points in the "terraform"
package, this commit also touches every file in the codebase which
interacted with those APIs. Most of the noise here is in updating tests
to take the same actions using the new API style, but this also affects
the main-code callers in the backends and in the command package.
My goal here was to refactor without changing observable behavior, but in
practice there are a couple externally-visible behavior variations here
that seemed okay in service of the broader goal:
- The "terraform graph" command is no longer hooked directly into the
core graph builders, because that's no longer part of the public API.
However, I did include a couple new Context functions whose contract
is to produce a UI-oriented graph, and _for now_ those continue to
return the physical graph we use for those operations. There's no
exported API for generating the "validate" and "eval" graphs, because
neither is particularly interesting in its own right, and so
"terraform graph" no longer supports those graph types.
- terraform.NewContext no longer has the responsibility for collecting
all of the provider schemas up front. Instead, we wait until we need
them. However, that means that some of our error messages now have a
slightly different shape due to unwinding through a differently-shaped
call stack. As of this commit we also end up reloading the schemas
multiple times in some cases, which is functionally acceptable but
likely represents a performance regression. I intend to rework this to
use caching, but I'm saving that for a later commit because this one is
big enough already.
The proximal reason for this change is to resolve the chicken/egg problem
whereby there was previously no single point where we could apply "moved"
statements to the previous run state before creating a plan. With this
change in place, we can now do that as part of Context.Plan, prior to
forking the input state into the three separate state artifacts we use
during planning.
However, this is at least the third project in a row where the previous
API design led to piling more functionality into terraform.NewContext and
then working around the incorrect order of operations that produces, so
I intend that by paying the cost/risk of this large diff now we can in
turn reduce the cost/risk of future projects that relate to our main
workflow actions.