Tuesday, June 23, 2009

The Power of Foreach

In D, arrays can be traversed using the foreach construct:

int [] a = [1, 5, 10, 42, 13];
foreach (i;a) {
writefln(“%d”, i);
}

The array of integers in this example is traversed and each element printed, in the natural order in which elements appear in the sequence. To visit the elements in reverse order, simply replace foreach with foreach_reverse. It is as intuitive as it gets.

Moreover, linear searches can be implemented with foreach: simply break out of the loop when the searched-for value is found:

foreach (i;a) {
writefln(“%d”, i);
if (i == 42) {
break;
}
}

Consider now a tree data structure where a tree node is defined as:

class TreeNode {
TreeNode left;
TreeNode right;
int value;
}

What is the meaning of a statement such as foreach (node; tree) { … }? The simple answer is that with the above definition of TreeNode, the code does not compile.

But if it were to compile, what should it do? Visit the nodes in order, or in a breadth-first fashion? No answer is the right one, unless we get to know more about the problem at hand. If we’re using a binary search tree to sort some data, then foreach would most likely visit the nodes in-order; if we’re evaluating a Polish-notation expression tree, we might want to consider post-order traversal.


Foreach Over Structs and Classes



Tree data structures and tree traversal occur often in computer science problems (and nauseatingly often in job interviews). Balanced binary trees are routinely used to implement associative containers, as C++ programmers are certainly familiar with the standard template collections set and map.

One difference between sequence containers (such as lists, arrays, and queues) on one hand, and containers implemented with trees on the other, is that there are more ways to iterate over the elements of a tree than there are ways to enumerate the elements of a sequence. A list (for example) can be traversed from begin to end and, in the case of double-linked lists, in reverse, from the end to the beginning; that’s it. But a tree can be traversed in order, in pre-order, post-order, or breadth first.

No built-in traversal algorithm will fit all possible application requirements. D’s approach is to provide the opApply operator as "standard plumbing" where users can plug their own algorithms for iterating over the elements of a class or struct. The operator is supposed to implement the iteration logic, and delegate the actual processing of the objects to a ... delegate:


class TreeNode {
public:
TreeNode left;
TreeNode right;
int value;
int opApply(int delegate(ref TreeNode) processNode) {
// ... tree traversal
// ...
return 0;
}
}

When the programmer writes a foreach loop, the compiler syntesizes a delegate function from the body of the loop, and passes it to opApply. In this example, the body of the delegate will have exactly one line that contains the writefln statement:

TreeNode tree = constructTree();
foreach(node; tree) {
writefln(“%d”, node.value);
}

For an in-order traversal, the implementation of opApply may look something like this:

int opApply(int delegate(ref TreeNode) processNode) {
return (left && left.opApply(processNode))
|| processNode(this)
|| (right && right.opApply(processNode));
}

The delegate that the compiler synthesizes out of the foreach body returns an integer (which is zero by default). A break statement in the foreach loop translates to the delegate function returning a non-zero value. As you can see in code above, a correct implementation of opApply should make sure that the iteration is "cancelled" when the delegate returns a non-zero value.

The traversal function’s argument must match the delegate argument type in the signature of opApply. In the example above the processNode function could modify the tree node that is passed in. If the TreeNode class writer wanted to outlaw such use, the opApply operator should have been declared to take a delegate that takes a const TreeNode parameter:

class TreeNode {
// …
int opApply(int delegate(ref const TreeNode) processNode) {
return processNode(this);
}
}

The new signature demands that the client code changes the parameter type from TreeNode to const TreeNode. Any attempt to modify the node object from within the user-supplied traversal function will fail to compile.

Another possible design is to encode all traversal algorithms as TreeNode methods. The following shows an example for the in-order algorithm (other traversal algorithms are left as an exercise for the reader):

class TreeNode {
// …
int traverseInOrder(int delegate(ref int) dg) {
if (left) {
int r = left.traverseInOrder(dg);
if (r) {
return r;
}
}
int r = dg(value);
if (r) {
return r;
}
if (right) {
r = right.traverseInOrder(dg);
if (r) {
return r;
}
}
return 0;
}
}

foreach(val; &tree.traverseInOrder) {
Console.WriteLine(val);
}


D Generators



A generator is a function or functor that returns a sequence, but instead of building an array or vector containing all the values and returning them all at once, a generator yields the values one at a time. Languages such as C# and Python have a yield keyword for this purpose. In D a generator can be implemented with foreach and a custom opApply operator. Assume one wants to print the prime numbers up to N, like this:

foreach (i; PrimeNumbers()) {
if (i > N) {
break;
}
writeln(i);
}

To make this work, the PrimeNumbers struct could be implemented like this:

struct PrimeNumbers {
int n = 1;
int primes[];

int opApply(int delegate(ref int) dg) {
loop:
while (true) {
++n;
foreach (p; primes) {
if (n % p == 0) {
continue loop;
}
}
primes ~= n;
if (dg(n)) {
break;
}
}
return 1;
}
}

Thursday, June 18, 2009

Pragma Assembly

Publishing the source code for D.NET on CodePlex in its current (rough) form turned out to be a great idea, as I received very good feedback. Tim Matthews of New Zealand has submitted several bug reports and patches and convinced me to change my stance on the "assembly class hack".

The hack was a very limited solution to the problem of qualifying imported declarations by the name of the assembly where they live. I described the problem and the attempt to hack around it at the end of a post back in December; a diligent reader commented that I should have used the pragma mechanism instead. I resisted the suggestion at the time, mainly because I am trying to avoid changing the compiler front-end if I can help it. (Front-end changes have to ultimately be merged back into Walter' s source tree, and he is a very very busy guy.)

A D implementation on .NET is not going to be of much use without the ability to generate code that imports and interfaces with existing assemblies. Guided by this idea, Tim Matthews did a lot of trail-blazing and prototyping and showed that because in .NET namespaces can span across assemblies, there has to be a way of specifying an arbitrary number of assemblies in one import file. My "assembly class" hack allowed for one and only one assembly name to be specified.

So I had to byte the bullet and do the right thing: assembly names are now specified with a pragma, like this:

pragma (assembly, "mscorlib")
{
// imported declarations here...
}

Any number of such blocks can appear within a D import file. And in the future, the code will be extended to allow a version and a public key to be specified after the assembly name.

Another thing that has to be fixed is to make the compiler adhere to the convention of using a slash between enclosing classes and nested types, as described in Serge Lidin's Expert .NET 2.0 IL Assembler, CHAPTER 7 page 139:
Since the nested classes are identified by their full name and their encloser (which is in turn identified by its scope and full name), the nested classes are referenced in ILAsm as a concatenation of the encloser reference, nesting symbol / (forward slash), and full name of the nested class


Tim also submitted a front-end patch that allows directories and import files of the same name to coexist, so that the code below compiles without errors:

import System;
import System.Windows.Forms;


An alternative solution that I proposed was to create import files named "this.d", so that the above would have read:

import System.this;
import System.Windows.Forms;


After some consideration, I bought Tim's point that this is not what .NET programmers would expect, and went on to apply his patch.