2023-08-22 04:38:28 -05:00
|
|
|
# How OpenTF Uses Unicode
|
2021-06-02 19:46:44 -05:00
|
|
|
|
2023-08-22 04:38:28 -05:00
|
|
|
The OpenTF language uses the Unicode standards as the basis of various
|
2021-06-02 19:46:44 -05:00
|
|
|
different features. The Unicode Consortium publishes new versions of those
|
|
|
|
standards periodically, and we aim to adopt those new versions in new
|
2023-08-22 04:38:28 -05:00
|
|
|
minor releases of OpenTF in order to support additional characters added
|
2021-06-02 19:46:44 -05:00
|
|
|
in those new versions.
|
|
|
|
|
|
|
|
Unfortunately due to those features being implemented by relying on a number
|
|
|
|
of external libraries, adopting a new version of Unicode is not as simple as
|
|
|
|
just updating a version number somewhere. This document aims to describe the
|
2023-08-22 04:38:28 -05:00
|
|
|
various steps required to adopt a new version of Unicode in OpenTF.
|
2021-06-02 19:46:44 -05:00
|
|
|
|
|
|
|
We typically aim to be consistent across all of these dependencies as to which
|
|
|
|
major version of Unicode we currently conform to. The usual initial driver
|
|
|
|
for a Unicode upgrade is switching to new version of the Go runtime library
|
|
|
|
which itself uses a new version of Unicode, because Go itself does not provide
|
|
|
|
any way to select Unicode versions independently from Go versions. Therefore
|
|
|
|
we typically upgrade to a new Unicode version only in conjunction with
|
|
|
|
upgrading to a new Go version.
|
|
|
|
|
|
|
|
## Unicode tables in the Go standard library
|
|
|
|
|
2023-08-22 04:38:28 -05:00
|
|
|
Several OpenTF language features are implemented in terms of functions in
|
2021-06-02 19:46:44 -05:00
|
|
|
[the Go `strings` package](https://pkg.go.dev/strings),
|
|
|
|
[the Go `unicode` package](https://pkg.go.dev/unicode), and other supporting
|
|
|
|
packages in the Go standard library.
|
|
|
|
|
|
|
|
The Go team maintains the Go standard library features to support a particular
|
|
|
|
Unicode version for each Go version. The specific Unicode version for a
|
|
|
|
particular Go version is available in
|
|
|
|
[`unicode.Version`](https://pkg.go.dev/unicode#Version).
|
|
|
|
|
|
|
|
We adopt a new version of Go by editing the `.go-version` file in the root
|
2023-08-22 04:38:28 -05:00
|
|
|
of this repository. Although it's typically possible to build OpenTF with
|
2021-06-02 19:46:44 -05:00
|
|
|
other versions of Go, that file documents the version we intend to use for
|
|
|
|
official releases and thus the primary version we use for development and
|
|
|
|
testing. Adopting a new Go version typically also implies other behavior
|
|
|
|
changes inherited from the Go standard library, so it's important to review the
|
|
|
|
relevant version changelog(s) to note any behavior changes we'll need to pass
|
2023-08-22 04:38:28 -05:00
|
|
|
on to our own users via the OpenTF changelog.
|
2021-06-02 19:46:44 -05:00
|
|
|
|
|
|
|
The other subsystems described below should always be set up to match
|
|
|
|
`unicode.Version`. In some cases those libraries automatically try to align
|
|
|
|
themselves with `unicode.Version` and generate an error if they cannot, but
|
|
|
|
that isn't true of all of them.
|
|
|
|
|
2022-11-16 11:20:03 -06:00
|
|
|
## Unicode Identifier Rules in HCL
|
|
|
|
|
|
|
|
_Identifier and Pattern Syntax_ (TF31) is a Unicode standards annex which
|
|
|
|
describe a set of rules for tokenizing "identifiers", such as variable names
|
|
|
|
in a programming language.
|
|
|
|
|
|
|
|
HCL uses a superset of that specification for its own identifier tokenization
|
|
|
|
rules, and so it includes some code derived from the TF31 data tables that
|
|
|
|
describe which characters belong to the "ID_Start" and "ID_Continue" classes.
|
|
|
|
|
2023-08-22 04:38:28 -05:00
|
|
|
Since OpenTF is the primary user of HCL, it's typically OpenTF's adoption
|
2022-11-16 11:20:03 -06:00
|
|
|
of a new Unicode version which drives HCL to adopt one. To update the Unicode
|
|
|
|
tables to a new version:
|
|
|
|
* Edit `hclsyntax/generate.go`'s line which runs `unicode2ragel.rb` to specify
|
|
|
|
the URL of the `DerivedCoreProperties.txt` data file for the intended Unicode
|
|
|
|
version.
|
|
|
|
* Run `go generate ./hclsyntax` to run the generation code to update both
|
|
|
|
`unicode_derived.rl` and, indirectly, `scan_tokens.go`. (You will need both
|
|
|
|
a Ruby interpreter and the Ragel state machine compiler on your system in
|
|
|
|
order to complete this step.)
|
|
|
|
* Run all the tests to check for regressions: `go test ./...`
|
|
|
|
* If all looks good, commit all of the changes and open a PR to HCL.
|
2023-08-22 04:38:28 -05:00
|
|
|
* Once that PR is merged and released, update OpenTF to use the new version
|
2022-11-16 11:20:03 -06:00
|
|
|
of HCL.
|
|
|
|
|
2021-06-02 19:46:44 -05:00
|
|
|
## Unicode Text Segmentation
|
|
|
|
|
|
|
|
_Text Segmentation_ (TR29) is a Unicode standards annex which describes
|
|
|
|
algorithms for breaking strings into smaller units such as sentences, words,
|
|
|
|
and grapheme clusters.
|
|
|
|
|
2023-08-22 04:38:28 -05:00
|
|
|
Several OpenTF language features make use of the _grapheme cluster_
|
2021-06-02 19:46:44 -05:00
|
|
|
algorithm in particular, because it provides a practical definition of
|
|
|
|
individual visible characters, taking into account combining sequences such
|
|
|
|
as Latin letters with separate diacritics or Emoji characters with gender
|
|
|
|
presentation and skin tone modifiers.
|
|
|
|
|
|
|
|
The text segmentation algorithms rely on supplementary data tables that are
|
|
|
|
not part of the core set encoded in the Go standard library's `unicode`
|
|
|
|
packages, and so instead we rely on the third-party module
|
|
|
|
[`github.com/apparentlymart/go-textseg`](http://pkg.go.dev/github.com/apparentlymart/go-textseg)
|
|
|
|
to provide those tables and a Go implementation of the grapheme cluster
|
|
|
|
segmentation algorithm in terms of the tables.
|
|
|
|
|
|
|
|
The `go-textseg` library is designed to allow calling programs to potentially
|
|
|
|
support multiple Unicode versions at once, by offering a separate module major
|
|
|
|
version for each Unicode major version. For example, the full module path for
|
|
|
|
the Unicode 13 implementation is `github.com/apparentlymart/go-textseg/v13`.
|
|
|
|
|
|
|
|
If that external library doesn't yet have support for the Unicode version we
|
|
|
|
intend to adopt then we'll first need to open a pull request to contribute
|
|
|
|
new language support. The details of how to do this will unfortunately vary
|
|
|
|
depending on how significantly the Text Segmentation annex has changed since
|
|
|
|
the most recently-supported Unicode version, but in many cases it can be
|
|
|
|
just a matter of editing that library's `make_tables.go`, `make_test_tables.go`,
|
|
|
|
and `generate.go` files to point to the URLs where the Unicode consortium
|
|
|
|
published new tables and then run `go generate` to rebuild the files derived
|
|
|
|
from those data sources. As long as the new Unicode version has only changed
|
|
|
|
the data tables and not also changed the algorithm, often no further changes
|
|
|
|
are needed.
|
|
|
|
|
|
|
|
Once a new Unicode version is included, the maintainer of that library will
|
|
|
|
typically publish a new major version that we can depend on. Two different
|
2023-08-22 04:38:28 -05:00
|
|
|
codebases included in OpenTF all depend directly on the `go-textseg` module
|
2021-06-02 19:46:44 -05:00
|
|
|
for parts of their functionality:
|
|
|
|
|
|
|
|
* [`hashicorp/hcl`](https://github.com/hashicorp/hcl) uses text
|
|
|
|
segmentation as part of producing visual column offsets in source ranges
|
2023-08-22 04:38:28 -05:00
|
|
|
returned by the tokenizer and parser. OpenTF in turn uses that library
|
|
|
|
for the underlying syntax of the OpenTF language, and so it passes on
|
2021-06-02 19:46:44 -05:00
|
|
|
those source ranges to the end-user as part of diagnostic messages.
|
|
|
|
* The third-party module [`github.com/zclconf/go-cty`](https://github.com/zclconf/go-cty)
|
2023-08-22 04:38:28 -05:00
|
|
|
provides several of the OpenTF language built in functions, including
|
2021-06-02 19:46:44 -05:00
|
|
|
functions like `substr` and `length` which need to count grapheme clusters
|
|
|
|
as part of their implementation.
|
|
|
|
|
2023-08-22 04:38:28 -05:00
|
|
|
As part of upgrading OpenTF's Unicode support we therefore typically also
|
2021-06-02 19:46:44 -05:00
|
|
|
open pull requests against these other codebases, and then adopt the new
|
2023-08-22 04:38:28 -05:00
|
|
|
versions that produces. OpenTF work often drives the adoption of new Unicode
|
2021-06-02 19:46:44 -05:00
|
|
|
versions in those codebases, with other dependencies following along when they
|
|
|
|
next upgrade.
|
|
|
|
|
2023-08-22 04:38:28 -05:00
|
|
|
At the time of writing OpenTF itself doesn't _directly_ depend on
|
|
|
|
`go-textseg`, and so there are no specific changes required in this OpenTF
|
2021-06-02 19:46:44 -05:00
|
|
|
codebase aside from the `go.sum` file update that always follows from
|
|
|
|
changes to transitive dependencies.
|
|
|
|
|
|
|
|
The `go-textseg` library does have a different "auto-version" mechanism which
|
|
|
|
selects an appropriate module version based on the current Go language version,
|
|
|
|
but neither HCL nor cty use that because the auto-version package will not
|
|
|
|
compile for any Go version that doesn't have a corresponding Unicode version
|
|
|
|
explicitly recorded in that repository, and so that would be too harsh a
|
|
|
|
constraint for libraries like HCL which have many callers, many of which don't
|
|
|
|
care strongly about Unicode support, that may wish to upgrade Go before the
|
|
|
|
text segmentation library has been updated.
|