There are some security and interoperability issues with PEP 383 (http://python.org/dev/peps/pep-0383/
), as outlined below.
While it appears that the security issues may not be problematic in Python, if used as intended, we should present these issues publicly in some way so that people using similar systems are made aware of the problems.
There is a known problem with file systems that use a legacy charset. When you use a Unicode API to find the files in a directory, you typically get back a list of Unicode file names. You use those names to access the files through some other API. There are two possible problems:
- One of the file names is invalid according to the legacy charset converter. For example, it is an SJIS string consisting of bytes <E0 30>.
- Two of the file names are mapped to the same Unicode string by the legacy charset converter.
These problems come up in other situations besides file systems as well. A common source is when a byte string that is valid in one charset is converted by a different charset's converter. For example, the byte string <E0 30> that is invalid in SJIS is perfectly meaningful in Latin-1, representing "à0".
A possible solution to this is to enable all charset converters to losslessly (reversibly) convert to Unicode. That is, any sequence of bytes can be converted by each charset converter to a Unicode string, and that Unicode string will be converted back to exactly that original sequence of bytes by that converter. This precludes, for example, the charset converter's mapping two different unmappable
byte sequences to
( � ) REPLACEMENT CHARACTER, since the original bytes could not be recovered. It also precludes having "fallbacks" (see http://unicode.org/reports/tr22/
): cases where two different byte sequences map to the same Unicode sequence.
PEP 383 Approach
The basic idea of PEP 383 is to be able to do this by converting all "unmappable" sequences to a sequence of one or more isolated high surrogate code points: that is, each code point's value is 0xD800 plus the corresponding unmappable byte value. With this mechanism, every maximal subsequence of bytes that can be reversibly mapped
to Unicode by the charset converter is so mapped; any intervening subsequences are converted to a sequence of high surrogates. The result is a Unicode String, but is not a well-formed UTF sequence.
For example, suppose that the byte 81 is illegal in charset n
. When converted to Unicode, PEP 383 represents this as U+D881. When mapped back to bytes (for charset n
), then that turns back into the byte 81. This allows the source byte sequence to be reversibly represented in a Unicode String, no matter what the contents. If this mechanism is applied to a charset converter that has no
fallbacks from bytes to Unicode, then the charset converter becomes reversible (from bytes to
Unicode to bytes).
Note that this only works when the Unicode string is converted back with the very same charset converter that was used to convert from bytes. For more information on PEP 383, see (http://python.org/dev/peps/pep-0383/
- B2Un is the bytes-to-Unicode converter for charset n
- U2Bn is the Unicode-to-bytes converter for charset n
- An invalid byte is one that would be mapped by a PEP to a high surrogate, because it is (or is part of a sequence that is) not reversibly mappable. Note that the context of the byte is important: 81 alone might be unmappable, while 81 followed by an 40 is valid (see http://demo.icu-project.org/icu-bin/convexp?conv=ibm-943_P15A-2003)
Unicode implementations have been subject to a number of security exploits (such as http://blogs.technet.com/srd/archive/2009/05/18/more-information-about-the-iis-authentication-bypass.aspx
) centered around ill-formed encoding. Systems making incorrect use of a PEP 383 style mechanism are subject to such an attack.
Suppose the source byte stream is <A B X D>, and that according to the charset converter being used (n), X is an invalid byte. B2Un transforms the byte stream into Unicode as <G Y H>, where Y is an isolated surrogate. U2Bn maps back to the correct original <A B X D>. That is the intended usage of PEP 383.
The problem comes when that Unicode sequence is converted back to bytes by a different charset converter m. Suppose that U2Bm maps Y into a valid byte representing "/", or any one of a number of other security-sensitive characters. That means that converting <G Y H> via U2Bm to bytes, and back to Unicode results in the string "G/Y", where the "/" did not exist in the original.
This violates one of the cardinal security rules for transformations of Unicode strings: creating a character where no valid character previously existed. This was, for example, at the heart of the "non-shortest form" security exploits. A gatekeeper is watching for suspicious characters. It doesn't see Y as one of them, but past the gatekeeper, a conversion of U2Bm followed by B2Um results in a suspicious character where none previously existed.
The suggested solution for this is that a converter can only map an isolated surrogate Y onto a byte stream when the resulting byte would be an illegal
byte. If not, then an exception would be thrown, or a replacement byte or byte sequence must be used instead (such as the SUB character). For details, see Safely Converting to Bytes
, below. This replacement would be similar to what is used when trying to convert a Unicode character that cannot be represented in the target encoding. That preserves the ability to round-trip when the same encoding is used, but prevents security attacks. Note that simply not represented (deleting) Y in the output is not an option, since that is also open to security exploits.
It appears that PEP 383 when used as intended
in Python is unlikely to present security problems. According to information from the author:
- PEP 383 is only intended for use with ASCII-based charsets.
- Only bytes >= 128 will be transformed to D8xx or back.
- The combination of these factors means that no ASCII-repertoire characters (which represent the most serious problems for security) would ever be generated.
- The primary use of PEP 383 is in file systems, and where the Unicode resulting from PEP 383 is only ever converted back to bytes on the same system, using the same charset converter.
However, if PEP 383 is used more generally by applications, or similar systems are used more generally, security exploits are possible. Developers of charset converters should also be aware of other possible security issues, so that they avoid shortest-form exploits and others, and should consult UTR #36: Unicode Security
for more information.
The choice of isolated surrogates (D8xx) as the way to represent the unconvertible bytes appears clever at first glance. However, it presents certain interoperability and security issue. Such isolated surrogates are not well formed. Although they can be represented in Unicode Strings, they are not supported by conformant UTF-8, UTF-16, or UTF-32 converters or implementations. This may cause interoperability problems, since many systems replace incoming ill-formed Unicode sequences by replacement characters.
It may also cause security problems. Although strongly discouraged for security reasons, some implementations may delete the isolated surrogates, which can cause a security problem when two substrings become adjacent that were previously separated.
There are different alternatives:
- Use 256 private use code points, somewhere in the ranges F0000..FFFFD or 100000..10FFFD. This would probably cause the fewest security and interoperability problems. There is some possibility of collision with other uses of private use characters.
- Use pairs of non-character code points in the range FDD0..FDEF. These are "super" private use characters, and are discouraged for general interchange. The transformation would take each nibble of a byte Y, and add to FDD0 and FDE0, respectively. However, noncharacter code points may be replaced by
U+FFFD ( � ) REPLACEMENT CHARACTER by some implementations, especially when they use them
internally. (Again, incoming characters must never be deleted, since that can cause security problems.)
- One could also ask the Unicode Consortium to encode characters expressly for that purpose. That would be essentially 256 different versions of
U+FFFD ( � ) REPLACEMENT CHARACTER. The downside of this approach is that even if it were accepted, it would take a couple of years to get into the standard.
Safely Converting to Bytes
The following describes how to safely convert a Unicode buffer U1 to a byte buffer B1 when the D8xx convention is used. It assumes that an exception is thrown if a D8xx is problematic. It can be enhanced to use substitution characters instead, if needed.
The simplest mechanism is just by brute force:
- Convert from Unicode buffer U1 to byte buffer B1.
- If there were any D8XX's in U1
- Convert back to Unicode buffer U2 (according to the same Charset C1)
- If U1 != U2, throw an exception.
Because the frequency of D8xx's will be very low, this approach is probably enough for many implementations.
There are a number of different ways to optimize this. Such approaches may vary depending on whether the converter is stateless or stateful.
- At any point at which the converter sees a D8xx, traverse the following D8xx's (up to a non-D8xx or EOS), putting each low byte into a buffer B2.
- Convert enough extra input bytes to B2 to provide for context (basically the maximum bytes per character minus 1).
- So <D841 D842 XX> => <41 42 YY>
- Convert B2 back to Unicode buffer U3 with the same converter, using an incremental conversion so that extra bytes at the end don't cause an error.
- If U3 consists of all D8xx's with low bytes matching B2, then append B2 (as far as you got) to B1 and continue.
- This will be the normal -- and unproblematic -- case.
- If not, throw an exception.
This is basically the same as Stateless, except for the conversion of B2 to U3, which has the following changes
- Scan backwards in the output byte buffer, and pick up the last shift sequence of each kind.
- For simple stateful encodings like 2022-JP, this will be just the last shift sequence (G0)
- For more complex stateful encodings like 2022-CN, this will be until we get a shift sequence for each of G0, G1, G2
- Create a new buffer B2a, which is B2 prepended with those shift sequences.
- Convert B2a to U3
More Optimized Stateless
In building the B2Un conversion table generate the following data and store:
Safe: xx is never part of any valid character in charset n.
Unsafe: xx is always part of some valid character in charset n.
Mixed: anything else; that is, safe or unsafe depending on context.
All single-byte charsets would
only have safe or unsafe bytes, so they are easy. The only ones that require more work are the mixed bytes, which
only occur in multi-byte charsets. Let's take SJIS. In the sequence
<D881 0030> it is safe to map the D881 back to 81, because <81
30> would map to <D881 0030>. But in the sequence <D881
0040> it is not, because <81 40> would map back to
Once this is done, the Stateless algorithm is modified to just map the Safe D8xx's, throw an exception on the Unsafe D8xx's, and use the plain Stateful process on any sequence of Mixed D8xx's.
More Optimized Stateful
- Instead of traversing backwards, record the state of the U2B converter, and use it for the B2U conversion of B2.
Further optimizations are also possible, for both Stateless and Stateful charset processes.
Mark Davis, 2009-05-29. Last updated 07-07