Bidi in URLs

There has been some discussion of having a special ordering for BIDI URLs so that they are more understandable to users. (I'll use URL in the broad sense, as including non-ASCII characters.) This is a complicated issue, and I can't claim to have all the answers, but here are some thoughts on the issue. 

(These come from email, and are somewhat clumsily stitched together. If I get some more time I'll try to make it more cohesive.)

Ideally labels in a URL would be in a consistent order. By "label", I mean in a broad sense, so each of the three letter sequences below counts as a label:
In the Unicode consortium, we've been aware of this issue, and have considered options a number of times over the years. However, we have not yet heard a good case for how supporting uniform field direction in URLs can be done without significant compatibility and security problems. There are some big stumbling blocks:
  • Many clients that display URLs will either not be URL aware, or not be aware of the latest standard, or not be able to parse out text as definitively belonging to a URL.
  • The specs have no termination criteria for parsing URLs in plain text. 
    • So http://abc.def#ghi could be either of the following, since fragments can include spaces:
    • (And in languages that don't use spaces to separate words, this is further complicated.) Different applications have different heuristics for this, but those heuristics don't always agree.
  • Many applications heuristically recognize fragment URLs, like "google.com". So in a broad sense, people understand a URL as "something that I could paste into an address bar in my browser and will get me to a page", and have the expectation that they will order similarly. That is, ordering "GOOGLE.COM" one way and "http://GOOGLE.COM" another would be confusing. Then the issue of parsing in plain text becomes even more challenging.
Why is ordering a problem? Suppose I have the URL http://ABC.DEF. Currently, any application that displays BIDI will do it as either http://FED.CBA ( in a LTR environment) or FED.CBA://http in a RTL one. If an application starts to display it as http://CBA.FED, then it represents a significant security problem, since the user will think it is the different URL http://DEF.ABC. As long as there is significant percentage of old applications, there will be the opportunity for that problem. The same goes for LTR URLs in a RTL environment.

Moreover, if I paste text between applications, even where the paragraph direction is constant, then the labels can flip in arbitrary ways if some applications support uniform direction and some don't. The challenge is to get all applications to consistently (a) be URL aware, and (b) all switch to some new display order in unison. It might be that someone can come up with a way to handle this, but we haven't heard of one yet.

(Had the importance of URL syntax been known at the time the consortium came up with the BIDI algorithm, and were the IRI syntax deterministic enough that the termination could always be recognized, even in the midst of plain text, we'd be in a different world.)

But we're not. That leaves a few more or less unpalatable alternatives:

All RTL. Any significant site that wants to support BIDI languages should provide for the ability to have IRIs with all RTL characters: host name, path, query, fragment. If all the pieces are RTL text (or infixed neutrals), than the display has a consistent direction in both RTL and LTR environment, no matter whether the application is URL-aware or not, and users won't be confused. Now that the TLD can be RTL, I think there will be pressure for the sites to do that, since completely-RTL IRIs will work much better in all environments in any event.

Shawn raised the issue of .html. As I think about it, there are a couple of ways to deal with this. First, even currently servers don't need to use those suffixes: http://unicode.org/reports/ doesn't contain a .html. Secondly, we could establish equivalences for some Hebrew and Arabic-script suffixes to take the place of the common ones. The scheme is also an issue; the IRI is still understandable (though ugly) if it has to be ASCII, but it would be somewhat better if it could have a RTL alias.  (Pure digit fields like IP addresses are a bit ugly, but seldom used.).

The % is an issue, although in an ideal world its use would be minimized in what the user sees. Although the characters have to be % encoded or punycoded to go over the web, they can be restored for display to the user. That is, only occurring in a label where the character would have to be quoted in order to not have the label be terminated. We can discuss how to handle the cases where they cannot be minimized; how sites can work around it, whether the remaining cases represent a significant problem, and if so, whether there is some alternative syntax that could be used.

Limited Markup. Another alternative would be to use a limited set of markup within URLs so as to preserve the right ordering. It would suffice to allow RTM and LTM characters around the neutral characters. Any BIDI URL could be normalized so as to include these characters in all and only the right places, by a compliant implementation. And once this was done, then the text can be cut and copies between applications with no change in appearance.

However, one would come up with sufficient constraints on the use of these characters so as to prevent their being used for spoofing, and could have a problem with breakage on older implementations. (Although in a way, breaking is better than sending people to the wrong place.)

What we recommend in the UBA is that if people are going to override the BIDI algorithm for any purpose, that they effectively do so by the insertion of bidi controls. So how would this play out with URLs?
  1. I type a URL into an address bar. Since the program is URL-aware*, it parses out the labels. Based on whatever standard mechanism is defined (eg the URL contains a RTL character), it is detected as a BIDI label, and ordered consistently. Effectively, that is done by inserting RLM at the start of each label that doesn't begin with a RTL character and at the end of each label that doesn't end with a RTL character. (One could use the embedding codes, but they are more dangerous.)
  2. This is the display form: when the URL is looked up, the RLMs have to be stripped before the domain is transformed into punycode and the rest is %unescaped.
  3. If I cut or copy that URL, then the RLMs go with it into plain text on the clipboard.
  4. When I paste that address into plain text, it then appears in the same order as it was in the address bar.
Take another case:
  1. I see a URL in some plain text (whether or not it is consistently ordered), and cut and paste that plaintext URL into an address bar (or other URL-aware* program). In that case, the program renormalizes the URL. That is, it strips out all bidi controls, and then reapplies the BIDI detection and RLM insertion. I then end up with consistent ordering in the result.
Where the query string contains LTR or RTL characters, there are a couple of choices. For most people, the query part is just technical gorp. And websites are able to put whatever they want into those strings; their interpretation is private to that site. So there are a couple of approaches (at least):
  • Not really bother with it: if it contains LTR characters then it reorders in a funny way, but since it is technical gorp we don't care. A
  • Have some simple standardized way of mapping LTR characters in the query part into bidi characters that sites can use if they want to be wholly RTL.
Note that in no cases would we expect people to manually put in the RLMs.

Change Default Ordering. Define a deterministic way to determine the end of a URL in plain text. Modify the BIDI algorithm to recommend (a) parsing for URLs, and (b) reordering the "labels"* to have a uniform direction. That would apply then in plain text, email, html, and address bars.

The proponents of specialized reordering really need to come up with a good story for how to deal with the security and interoperability issues presented by plaintext applications and non-new-URL-ordering applications.

There are actually two variants of this: 
a. have the consistent order be LTR.
b. have the consistent order be the paragraph direction.
Comments