Tuesday, April 28, 2009

Slice and D-ice

It is amazing how much insight one can get into a language by simply writing a compiler for it... Today I am going to spend half a lunch break copying and pasting into this post a stack of notes related to array slices (collected over the past few months of working on a .net back-end for the D compiler).

D sports a data type called array slice that is intended to boost the performance of array-intensive computations. A slice is a lightweight value type, conceptually equivalent to a range of array elements, or a "view" into an array. It can be thought of as consisting of a reference to an array, and a pair of begin-end indexes into the array:

struct (T) ArraySlice {
T[] a; // reference to array
int begin; // index where slice begins
int end; // one past index where slice ends
}

The actual representation of a slice is currently internal to the compiler, and completely opaque to the programmer. The template struct above is not what the layout of a slice actually looks like, it is intended for illustrative purposes only.

Consider this code:

int a[] = [1, 3, 5, 7, 9, 11, 13, 17];
int[] s = a[2..5];

The second declaration introduces "s" as a slice of array "a", starting at position two and ending at (but not including) the fifth element. Using the template pseudo-definition of ArraySlice, the code is conceptually equivalent to:

int a[] = [1, 3, 5, 7, 9, 11, 13, 17];
ArraySlice!(int) s = { a, 2, 5 };

To understand how array slices may help performance, consider an application that reads and parses XML files. The input can be loaded as an array of characters (a huge string). The application builds a DOM representation of the input, and each node in the DOM contains a token string. This approach is wasteful, because the token strings hold copies of character sequences that are already present in the input; copying tokens around has a linear complexity (it is directly proportional with the number of characters in the token) and the same is true for the spatial complexity (how much memory is being used). But XML tokens could be modeled as slices of character arrays ("views" into the original XML input string), and complexity in both time and space would drop down to a constant value.

This design can be implemented in other languages than D, but memory management issues may add unnecessary complexity. In C++ for example we'll have to make sure that the life-time of the token slices does not exceed the scope of the original input string. D belongs to the family of garbage-collected languages; by default, objects are reference-counted, and holding a reference to an array slice indirectly keeps the original array "alive", because the slice contains a reference to it.

Now that the design rationale behind array slices is understood, let's take another look at the syntax. You have probably noticed that in the statement:
int[] s = a[2..5];

The declaration part introduces "s" as an array of integers; it is not until you see the assignment that the lightweight, array slice, true nature of "s" is revealed.
D has no special syntax for disambiguating between "true" arrays and array slices in declarations; they can be used interchangeably. As a matter of fact, a function signature with array parameters will happily accept a slice as argument. In the following code, both "a" and "s" are legal arguments for the count function:

int count(int[] arr) {
return arr.length; // return the number of elements in arr
}
writefln("%d", count(a)); // ok
writefln("%d", count(s)); // also ok


Resizing Slices


Both arrays and slices support the built-in length property. As you expect, an expression such as a.length tells you how many elements are present in the array; the property applies to array slices as well, and in that case it gives the number of elements within the slice range. For example, the output of the following code is "3":

int a[] = [1, 2, 3, 5, 7, 9, 11, 13, 17];
int[] s = a[2..5];
writefln("%d", s.length); // prints 3

So far so good, but I forgot to mention that the length property is read-write: not only can you query the size of an array, you can also change it, like this:

a.length = 100; // extend the array to 100 elements

Assignment to the length property resizes the array. This begs the question: what happens when an array slice is being resized? The answer of course is "it depends".
With "a" and "s" defined as above, let's say we resize the "s" slice from 3 elements to 7:

s.length = 7;

This extends the view of "s" into "a" up to the ninth element of "a" ("s" starts at 2). It is as we have said:

s = a[2..9];

The slice is still within the bounds of the "a" array. Resizing it is a constant-time operation that changes one field inside the internal representation of "s". If instead of the built-in slice we had used the template ArraySlice struct introduced above, resizing the slice would have amounted to:

int a[] = [1, 3, 5, 7, 9, 11, 13, 17];
ArraySlice!(int) s = { a, 2, 5 };
s.end = 9; // resize the slice

Because "s" is simply a "view" of the array, modifying an element in the array is immediately reflected into the slice, for example:

writefln("%d %d", a[2], s[0]); // prints "5 5"
a[2] = 23;
writefln("%d %d", a[2], s[0]); // prints "23 23"

What happens if we re-size "s" past the end of "a"?

s.length = 100; // what does this do?

The answer is that the behavior is up to the compiler. The current native compiler from Digital Mars changes the type of "s" from a lightweight view into an array to a full-fledged array, and re-sizes it to fit 100 elements.

int a[] = [1, 2, 3, 5, 7, 9, 11, 13, 17];
int[] s = a[2..5];
writefln("%d %d", a[2], s[0]); // prints "5 5"
s.length = 100;
a[2] = 23;
writefln("%d %d", a[2], s[0]); // prints "23 5"

In other words, resizing a slice past the end of the original array breaks up the relationship between the two, and from that point they go their separate merry ways.
This behavior also underlines the schizophrenic nature of array slices that are not full copies of arrays, unless they change their mind.

Concatenating Slices


We saw how array slices may be re-sized via the “length” property. Slices may be re-sized implicitly via concatenation, as in the following example:

int a[] = [1, 2, 3, 5, 7, 9, 11, 13, 17];
int s[] = a[0..5];
s ~= [23, 29];

The tilde is the concatenation operator for arrays, strings and slices. In this example, the slice is resized to accommodate the elements 23 and 29. Note that even that in this situation the resizing is different from had we written:
s.length += 2;

Extending the length by two elements simply bumps up the upper limit of the slice (because there is still room in the original array, "a"). As we saw in the previous section, if the new length exceeds the bounds of the original array, the slice will be "divorced" from the original array, and promoted from a light-weight view to a full, first-class array. If we just extend the length by two elements, the bounds of "a" are not exceeded.

However, in the case of appending (as in s ~= [23, 29]) in addition to resizing we are also setting the values of two additional elements. The slice needs to be divorced from the array, so the a[5] and a[6] are not overwritten with the values 23 and 29. The compiler turns "s" into a full array of length + 2 == 7 elements, copies the elements of “a” from 0 to 5, then appends values 23 and 29.

The problem, as with resizing past the bounds of the original array, is that after the array and the slice part their ways, it is no longer possible to modify value types in the original array via the slice (which has now been promoted to a standalone array). This is a run time behavior, hard to predict by statically analyzing (or visually inspecting) the code.

Rolling Your Own Array Slices



It is impossible to determine statically whether a D function parameter is an array or a slice by examining the function's code alone. It is up to the function's caller to pass in an array or a slice.

void f (int[] a) {
// ...
}
int a[] = [1, 2, 3, 5, 7, 9, 11, 13, 17];
f(a); // call f with an array argument
f(a[2..5]); // call f with a slice argument;

In some cases you may want to better communicate that your function is intended to work with slices rather than arrays. You may also want to have better control over the slice's properties. Say for example you want to make sure that a slice is never re-sized.

You can accomplish these things by rolling your own ArraySlice. The template struct that was introduced earlier is a good starting point. The signature of "f" can be changed to:

void f (ArraySlice!(int) a) { // ...

That's a good start, but the struct is not compatible with a built-in array slice. The following code does not compile:

struct (T) ArraySlice {
T[] a; // reference to array
int begin; // index where slice begins
int end; // one past index where slice ends
}

int a[] = [1, 2, 3, 5, 7, 9, 11, 13, 17];
ArraySlice!(int) s = { a, 2, 5 };
foreach(i; s) { // error, does not compile
writefln("%d", i);
}

You could of course rewrite the foreach loop to use the begin..end range:

foreach(i; s.begin..s.end) {
writefln("%d", s.a[i]);
}

In addition to being more verbose such code is not very well encapsulated, since it explicitly accesses the struct's public members. If we later decide to factor out ArraySlice into its own module, and make the “a”, “begin”, and “end” members private, the code above will not compile anymore.

All that's preventing the compact version of the foreach loop from compiling is that the opApply operator is missing. So let's add one:

struct (T) ArraySlice {
// ...
int opApply(int delegate(ref int) dg) {
foreach (i;begin..end) {
dg(a[i]);
}
return 0;
}
}

Great! This gets us past the compilation error. The foreach loop now compiles and prints out all the elements in the slice. There is a small and subtle bug in this code though. Suppose that instead of printing all elements in the slice, you're doing a linear search, for example:

foreach(i; s) {
if (i == 5) { // found it!
break;
}
}

Astonishingly enough, this code will not break out of the foreach loop, instead it will continue through all the elements in the slice.

The http://www.digitalmars.com/d/2.0/statement.html website prescribes the behavior of the opApply operator:
"The body of the apply function iterates over the elements it aggregates, passing them each to the dg function. If the dg returns 0, then apply goes on to the next element. If the dg returns a nonzero value, apply must cease iterating and return that value".
The D compiler synthesizes the delegate function from the body of the foreach loop, and the code above is transformed internally to this equivalent form:

int dg(ref int i) {
if (i == 5) {
return 1;
}
return 0;
}
s.opApply(&dg);

The bug in the opApply operator is that the loop should be broken out of when dg returns non-zero:

int opApply(int delegate(ref int) dg) {
foreach (i; begin..end) {
if (dg(a[i])) break;
}
return 0;
}

Now foreach works correctly. To support foreach_reverse, just add a opApplyReverse member function to the ArraySlice template struct:

int opApplyReverse(int delegate(ref int) dg) {
foreach_reverse (i; begin..end) {
if (dg(a[i])) break;
}
return 0;
}

What about manipulating the length of the slice? Neither of the lines below compiles:

writefln("%d", s.length);
s.length = 100;

To support the length property, we have to add these methods:

struct (T) ArraySlice {
// ...
// return the length of the slice
int length() { return end - begin; }

// resize the slice, preventing it to grow past
// the original array's length
void length(int newLength) {
end = begin + newlength;
if (end > a.length) { end = a.length; }
}
}

To prevent resizing the slice, all there is to do is to leave the second overload undefined, effectively turning length into a read-only property.

Clients of the struct can still set its individual members to inconsistent values. To disallow incorrect usage, the "a", "begin", and "end" members can be made private (and the struct will have to move to its own module, because in D private access is not enforced if the class or struct lives in the same module as the client code).

To make the ArraySlice struct even more source-compatible with built-in slices, you can give it an opIndex operator:

struct (T) ArraySlice {
// ...
T opIndex(size_t i) { return a[i]; }
}

The opIndex operator allows this to work:
writefln("%d", s[3]);

Assignment to array elements is not allowed:

s[3] = 42; // error, does not compile

If you want the array elements to be modified via the slice like that, just define
opIndexAssign:

struct (T) ArraySlice {
// ...
void opIndexAssign(size_t i, T val) { a[i] = val; }
}

When you put it all together, the ArraySlice struct will look something like this:

struct (T) ArraySlice {
private:
T[] a; // reference to array
int begin; // index where slice begins
int end; // one past index where slice ends
public:
T opIndex(size_t i) { return a[i]; }
void opIndexAssign(size_t i, T val) { a[i] = val; }
int length() { return end - begin; }
// comment this function out to prevent resizing
void length(int newLength) {
end = begin + newlength;
if (end > a.length) { end = a.length; }
}

// support foreach
int opApply(int delegate(ref int) dg) {
foreach (i;begin..end) {
if (dg(a[i])) break;
}
return 0;
}
}

In conclusion, by rolling your own array slice implementation the intent of your code becomes clearer, your level of control over it increases, and you can still retain the brevity of the built-in slices.

1 comment:

Witek Baryluk said...

You have error in opIndexAssign, as I remember correclty, arguments are written in another order:

void opIndexAssign(T val, uint index) { ... }

it is also good to change void to T, and return val.

so chaining will work:
a[2] = a[4] = a[6] = 42;