Wednesday, March 11, 2009

Strumming Strings In the D Chord

In my recent interview with the InfoQ technology magazine I was asked about compatibility issues between D and .NET. I replied with a brief description of how array slices in D raise a conceptual incompatibility: System.Array and System.ArraySegment are distinct, unrelated types. In D arrays slices are indistinguishable from arrays and this creates the problems that I mentioned in the interview.

But there are other incompatibilities between D and .NET that I did not mention because I wanted to keep the interview lean and focused.

Take for example the built-in support for strings.

The keyword string is shorthand in both IL and C# for System.String (essentially a sequence of Unicode characters in the range U+0000 to U+FFFF).

In the D programming language, string is shorthand for invariant char[] and characters are unsigned bytes.

Side notes: D uses UTF8 to support foreign languages, and also sports the types wstring (a sequence of 2 byte-wide characters, compatible with Microsoft's Unicode) and dstring (for UTF-32 strings). Wide and double-wide (UTF-32) string literals are denoted by the "w" and "d" respective suffixes, as in: "Hello"w, "Good Bye"d (UTF-32 dchar works in the native D compiler, but is not currently supported in my .NET compiler).

When a variable of type string is encountered in a D source, the compiler emits a corresponding IL variable with the type unsigned int8[].

In IL there is a special instruction, ldstr, for loading string literals on the evaluation stack. This code
ldstr "Hello"
loads a "Hello" [mscorlib]System.String. If this literal is to be stored into a variable (say "x"), then my compiler will insert conversion code that looks somewhat like this:

call class [mscorlib]System.Text.Encoding
ldstr "Hello"
callvirt instance uint8[]

stloc 'x' // store byte array into variable x

For the cases where a D string (array of bytes) has to be converted to a System.String I provide an explicit string property, called sys, with the following D prototype:

static public String sys(string x);

The D programmer would write something like this:

import System;
// ... snip ...
string x = "Hello";
Console.WriteLine("Hello .NET".sys);

The compiler figures out that in the case of
Console.WriteLine("Hello .NET".sys);
the call to the sys function can be elided, and generates straightforwardly:
ldstr "Hello .NET"
call void [mscorlib]System.Console::'WriteLine' (string)

Matters get more interesting when we consider associative arrays. D offers a great convenience to programmers by supporting associative arrays directly in the language, for example

int [string] dict;

dict["one"] = 1;
dict["two"] = 2;
// ...

introduces an array of integers indexed by strings. By contrast, in other languages such data structures are implemented "externally" in a library; in C++ for example, std::map<std::string, int> is implemented in the STL; the C# equivalent of an associative array is the System.Collections.Generic.Dictionary. My friend and colleague Ionut Gabriel Burete contributed an implementation of associative arrays in the D compiler for .NET using exactly that class.

An associative array / dictionary with string keys is an interesting case, because System.String::Equals does the right thing out of the box, namely performs a lexicographical comparison of two strings; System.Array::Equals however simply compares object references, it does not iterate over and compare elements. This means that a Dictionary(string, int) will behave as you expect, but if the key is of the unsigned int8[] type you may be in for a surprise.

For this reason, I put a hack in the compiler back-end: for the case of associative arrays I break away from representing D strings as byte arrays in .NET, and use System.String instead, which works great (or so I thought up until I ran into the problem of generating the code for a foreach iteration loop).

For D code such as:

import System;
int [string] x;
x["one"] = 1;
x["two"] = 2;
// ...
foreach (k, v; x) {
Console.WriteLine("{0}, {1}", k, v);

the compiler synthesizes a function that corresponds to the body of the loop, and binds the key and value ("k" and "v" in the code snippet) to local variables in that function (this happens all inside the front-end).

The back-end must be able to detect the variables bound to foreach arguments and reconcile the data types where necessary. In the example above the type of the "k" variable in the IL will thus be System.String, and not unsigned int8[].


Anonymous said...

I think, you should use native System.String more aggressively and convert strings to utf8 when and only when it's really needed, e.g. on casting to other pointer types, pinvoke etc.

The Free Meme said...

Hi Anonymous, I tried using System.String by default like you are suggesting but I found it very difficult to get string slices to work right.

Anonymous said...

Do you mean, programmer could carefully count non-english utf8 char lengths for slices? I doubt someone did this. Isn't it better to change specs a little than coping with full compatibility? Do you plan full compatibility? I think, some features should change, since .net way is quite different from D way. Well... if you hope will outperform other .net dialects introducins D way into .net... hmm... it will be interesting.

The Free Meme said...

Anonymous, here's the crux of the issue: in D, a string is an array; arrays and slices can also be used interchangeably.

In .net System.String, System.Array and the generic System.ArraySegment are three different types. So, if I want to take a slice of a string in D/.NET (assuming strings implemented as System.String) I would need two conversions: from string to array of System.Char, and from that to ArraySegment.

Consider this (contrived, I agree) example:
void foo(string a)
cast(char)(a [0]) = 'W';
void main()
string s = "hello world";
assert(s[6] == 'W');
The generated code to make this work is quite complicated and very inefficient (again, assuming string === System.String). With strings implemented as unsigned int8[] arrays generating the code is quite straight forward.

I think that I found a good compromise so far... We'll see how well this works "in the field" once I release it :)