opentofu/docs/unicode.md

# How Terraform Uses Unicode

The Terraform language uses the Unicode standards as the basis of various
different features. The Unicode Consortium publishes new versions of those
standards periodically, and we aim to adopt those new versions in new
minor releases of Terraform in order to support additional characters added
in those new versions.

Unfortunately due to those features being implemented by relying on a number
of external libraries, adopting a new version of Unicode is not as simple as
just updating a version number somewhere. This document aims to describe the
various steps required to adopt a new version of Unicode in Terraform.

We typically aim to be consistent across all of these dependencies as to which
major version of Unicode we currently conform to. The usual initial driver
for a Unicode upgrade is switching to new version of the Go runtime library
which itself uses a new version of Unicode, because Go itself does not provide
any way to select Unicode versions independently from Go versions. Therefore
we typically upgrade to a new Unicode version only in conjunction with
upgrading to a new Go version.

## Unicode tables in the Go standard library

Several Terraform language features are implemented in terms of functions in
[the Go `strings` package](https://pkg.go.dev/strings),
[the Go `unicode` package](https://pkg.go.dev/unicode), and other supporting
packages in the Go standard library.

The Go team maintains the Go standard library features to support a particular
Unicode version for each Go version. The specific Unicode version for a
particular Go version is available in
[`unicode.Version`](https://pkg.go.dev/unicode#Version).

We adopt a new version of Go by editing the `.go-version` file in the root
of this repository. Although it's typically possible to build Terraform with
other versions of Go, that file documents the version we intend to use for
official releases and thus the primary version we use for development and
testing. Adopting a new Go version typically also implies other behavior
changes inherited from the Go standard library, so it's important to review the
relevant version changelog(s) to note any behavior changes we'll need to pass
on to our own users via the Terraform changelog.

The other subsystems described below should always be set up to match
`unicode.Version`. In some cases those libraries automatically try to align
themselves with `unicode.Version` and generate an error if they cannot, but
that isn't true of all of them.

## Unicode Text Segmentation

_Text Segmentation_ (TR29) is a Unicode standards annex which describes
algorithms for breaking strings into smaller units such as sentences, words,
and grapheme clusters.

Several Terraform language features make use of the _grapheme cluster_
algorithm in particular, because it provides a practical definition of
individual visible characters, taking into account combining sequences such
as Latin letters with separate diacritics or Emoji characters with gender
presentation and skin tone modifiers.

The text segmentation algorithms rely on supplementary data tables that are
not part of the core set encoded in the Go standard library's `unicode`
packages, and so instead we rely on the third-party module
[`github.com/apparentlymart/go-textseg`](http://pkg.go.dev/github.com/apparentlymart/go-textseg)
to provide those tables and a Go implementation of the grapheme cluster
segmentation algorithm in terms of the tables.

The `go-textseg` library is designed to allow calling programs to potentially
support multiple Unicode versions at once, by offering a separate module major
version for each Unicode major version. For example, the full module path for
the Unicode 13 implementation is `github.com/apparentlymart/go-textseg/v13`.

If that external library doesn't yet have support for the Unicode version we
intend to adopt then we'll first need to open a pull request to contribute
new language support. The details of how to do this will unfortunately vary
depending on how significantly the Text Segmentation annex has changed since
the most recently-supported Unicode version, but in many cases it can be
just a matter of editing that library's `make_tables.go`, `make_test_tables.go`,
and `generate.go` files to point to the URLs where the Unicode consortium
published new tables and then run `go generate` to rebuild the files derived
from those data sources. As long as the new Unicode version has only changed
the data tables and not also changed the algorithm, often no further changes
are needed.

Once a new Unicode version is included, the maintainer of that library will
typically publish a new major version that we can depend on. Two different
codebases included in Terraform all depend directly on the `go-textseg` module
for parts of their functionality:

* [`hashicorp/hcl`](https://github.com/hashicorp/hcl) uses text
  segmentation as part of producing visual column offsets in source ranges
  returned by the tokenizer and parser. Terraform in turn uses that library
  for the underlying syntax of the Terraform language, and so it passes on
  those source ranges to the end-user as part of diagnostic messages.
* The third-party module [`github.com/zclconf/go-cty`](https://github.com/zclconf/go-cty)
  provides several of the Terraform language built in functions, including
  functions like `substr` and `length` which need to count grapheme clusters
  as part of their implementation.

As part of upgrading Terraform's Unicode support we therefore typically also
open pull requests against these other codebases, and then adopt the new
versions that produces. Terraform work often drives the adoption of new Unicode
versions in those codebases, with other dependencies following along when they
next upgrade.

At the time of writing Terraform itself doesn't _directly_ depend on
`go-textseg`, and so there are no specific changes required in this Terraform
codebase aside from the `go.sum` file update that always follows from
changes to transitive dependencies.

The `go-textseg` library does have a different "auto-version" mechanism which
selects an appropriate module version based on the current Go language version,
but neither HCL nor cty use that because the auto-version package will not
compile for any Go version that doesn't have a corresponding Unicode version
explicitly recorded in that repository, and so that would be too harsh a
constraint for libraries like HCL which have many callers, many of which don't
care strongly about Unicode support, that may wish to upgrade Go before the
text segmentation library has been updated.
docs: Some notes on adopting new Unicode versions Previously this is a task that I've just taken on using my own knowledge of how these parts fit together, but that's not at all a sustainable situation, and so here I've just documented the steps I've followed in for previous Unicode version upgrades, and thus which I'd expect to follow for future Unicode version upgrades too. I hope this is also a somewhat-interesting overview of how Terraform makes use of Unicode, even for folks who are just working on the other relevant features of Terraform and not specifically trying to change Terraform's Unicode support. Since I'm the primary maintainer of two of the dependencies involved in this it seems likely that following this process will still involve me in _some_ capacity, but hopefully only necessarily in the form of reviewing PRs and cutting new releases of those libraries. 2021-06-02 19:46:44 -05:00			`# How Terraform Uses Unicode`

			`The Terraform language uses the Unicode standards as the basis of various`
			`different features. The Unicode Consortium publishes new versions of those`
			`standards periodically, and we aim to adopt those new versions in new`
			`minor releases of Terraform in order to support additional characters added`
			`in those new versions.`

			`Unfortunately due to those features being implemented by relying on a number`
			`of external libraries, adopting a new version of Unicode is not as simple as`
			`just updating a version number somewhere. This document aims to describe the`
			`various steps required to adopt a new version of Unicode in Terraform.`

			`We typically aim to be consistent across all of these dependencies as to which`
			`major version of Unicode we currently conform to. The usual initial driver`
			`for a Unicode upgrade is switching to new version of the Go runtime library`
			`which itself uses a new version of Unicode, because Go itself does not provide`
			`any way to select Unicode versions independently from Go versions. Therefore`
			`we typically upgrade to a new Unicode version only in conjunction with`
			`upgrading to a new Go version.`

			`## Unicode tables in the Go standard library`

			`Several Terraform language features are implemented in terms of functions in`
			[the Go `strings` package](https://pkg.go.dev/strings),
			[the Go `unicode` package](https://pkg.go.dev/unicode), and other supporting
			`packages in the Go standard library.`

			`The Go team maintains the Go standard library features to support a particular`
			`Unicode version for each Go version. The specific Unicode version for a`
			`particular Go version is available in`
			[`unicode.Version`](https://pkg.go.dev/unicode#Version).

			We adopt a new version of Go by editing the `.go-version` file in the root
			`of this repository. Although it's typically possible to build Terraform with`
			`other versions of Go, that file documents the version we intend to use for`
			`official releases and thus the primary version we use for development and`
			`testing. Adopting a new Go version typically also implies other behavior`
			`changes inherited from the Go standard library, so it's important to review the`
			`relevant version changelog(s) to note any behavior changes we'll need to pass`
			`on to our own users via the Terraform changelog.`

			`The other subsystems described below should always be set up to match`
			`unicode.Version`. In some cases those libraries automatically try to align
			themselves with `unicode.Version` and generate an error if they cannot, but
			`that isn't true of all of them.`

			`## Unicode Text Segmentation`

			`_Text Segmentation_ (TR29) is a Unicode standards annex which describes`
			`algorithms for breaking strings into smaller units such as sentences, words,`
			`and grapheme clusters.`

			`Several Terraform language features make use of the _grapheme cluster_`
			`algorithm in particular, because it provides a practical definition of`
			`individual visible characters, taking into account combining sequences such`
			`as Latin letters with separate diacritics or Emoji characters with gender`
			`presentation and skin tone modifiers.`

			`The text segmentation algorithms rely on supplementary data tables that are`
			not part of the core set encoded in the Go standard library's `unicode`
			`packages, and so instead we rely on the third-party module`
			[`github.com/apparentlymart/go-textseg`](http://pkg.go.dev/github.com/apparentlymart/go-textseg)
			`to provide those tables and a Go implementation of the grapheme cluster`
			`segmentation algorithm in terms of the tables.`

			The `go-textseg` library is designed to allow calling programs to potentially
			`support multiple Unicode versions at once, by offering a separate module major`
			`version for each Unicode major version. For example, the full module path for`
			the Unicode 13 implementation is `github.com/apparentlymart/go-textseg/v13`.

			`If that external library doesn't yet have support for the Unicode version we`
			`intend to adopt then we'll first need to open a pull request to contribute`
			`new language support. The details of how to do this will unfortunately vary`
			`depending on how significantly the Text Segmentation annex has changed since`
			`the most recently-supported Unicode version, but in many cases it can be`
			just a matter of editing that library's `make_tables.go`, `make_test_tables.go`,
			and `generate.go` files to point to the URLs where the Unicode consortium
			published new tables and then run `go generate` to rebuild the files derived
			`from those data sources. As long as the new Unicode version has only changed`
			`the data tables and not also changed the algorithm, often no further changes`
			`are needed.`

			`Once a new Unicode version is included, the maintainer of that library will`
			`typically publish a new major version that we can depend on. Two different`
			codebases included in Terraform all depend directly on the `go-textseg` module
			`for parts of their functionality:`

			* [`hashicorp/hcl`](https://github.com/hashicorp/hcl) uses text
			`segmentation as part of producing visual column offsets in source ranges`
			`returned by the tokenizer and parser. Terraform in turn uses that library`
			`for the underlying syntax of the Terraform language, and so it passes on`
			`those source ranges to the end-user as part of diagnostic messages.`
			* The third-party module [`github.com/zclconf/go-cty`](https://github.com/zclconf/go-cty)
			`provides several of the Terraform language built in functions, including`
			functions like `substr` and `length` which need to count grapheme clusters
			`as part of their implementation.`

			`As part of upgrading Terraform's Unicode support we therefore typically also`
			`open pull requests against these other codebases, and then adopt the new`
			`versions that produces. Terraform work often drives the adoption of new Unicode`
			`versions in those codebases, with other dependencies following along when they`
			`next upgrade.`

			`At the time of writing Terraform itself doesn't _directly_ depend on`
			`go-textseg`, and so there are no specific changes required in this Terraform
			codebase aside from the `go.sum` file update that always follows from
			`changes to transitive dependencies.`

			The `go-textseg` library does have a different "auto-version" mechanism which
			`selects an appropriate module version based on the current Go language version,`
			`but neither HCL nor cty use that because the auto-version package will not`
			`compile for any Go version that doesn't have a corresponding Unicode version`
			`explicitly recorded in that repository, and so that would be too harsh a`
			`constraint for libraries like HCL which have many callers, many of which don't`
			`care strongly about Unicode support, that may wish to upgrade Go before the`
			`text segmentation library has been updated.`