Tuesday, June 23, 2009

The Power of Foreach

In D, arrays can be traversed using the foreach construct:

int [] a = [1, 5, 10, 42, 13];
foreach (i;a) {
writefln(“%d”, i);
}

The array of integers in this example is traversed and each element printed, in the natural order in which elements appear in the sequence. To visit the elements in reverse order, simply replace foreach with foreach_reverse. It is as intuitive as it gets.

Moreover, linear searches can be implemented with foreach: simply break out of the loop when the searched-for value is found:

foreach (i;a) {
writefln(“%d”, i);
if (i == 42) {
break;
}
}

Consider now a tree data structure where a tree node is defined as:

class TreeNode {
TreeNode left;
TreeNode right;
int value;
}

What is the meaning of a statement such as foreach (node; tree) { … }? The simple answer is that with the above definition of TreeNode, the code does not compile.

But if it were to compile, what should it do? Visit the nodes in order, or in a breadth-first fashion? No answer is the right one, unless we get to know more about the problem at hand. If we’re using a binary search tree to sort some data, then foreach would most likely visit the nodes in-order; if we’re evaluating a Polish-notation expression tree, we might want to consider post-order traversal.


Foreach Over Structs and Classes



Tree data structures and tree traversal occur often in computer science problems (and nauseatingly often in job interviews). Balanced binary trees are routinely used to implement associative containers, as C++ programmers are certainly familiar with the standard template collections set and map.

One difference between sequence containers (such as lists, arrays, and queues) on one hand, and containers implemented with trees on the other, is that there are more ways to iterate over the elements of a tree than there are ways to enumerate the elements of a sequence. A list (for example) can be traversed from begin to end and, in the case of double-linked lists, in reverse, from the end to the beginning; that’s it. But a tree can be traversed in order, in pre-order, post-order, or breadth first.

No built-in traversal algorithm will fit all possible application requirements. D’s approach is to provide the opApply operator as "standard plumbing" where users can plug their own algorithms for iterating over the elements of a class or struct. The operator is supposed to implement the iteration logic, and delegate the actual processing of the objects to a ... delegate:


class TreeNode {
public:
TreeNode left;
TreeNode right;
int value;
int opApply(int delegate(ref TreeNode) processNode) {
// ... tree traversal
// ...
return 0;
}
}

When the programmer writes a foreach loop, the compiler syntesizes a delegate function from the body of the loop, and passes it to opApply. In this example, the body of the delegate will have exactly one line that contains the writefln statement:

TreeNode tree = constructTree();
foreach(node; tree) {
writefln(“%d”, node.value);
}

For an in-order traversal, the implementation of opApply may look something like this:

int opApply(int delegate(ref TreeNode) processNode) {
return (left && left.opApply(processNode))
|| processNode(this)
|| (right && right.opApply(processNode));
}

The delegate that the compiler synthesizes out of the foreach body returns an integer (which is zero by default). A break statement in the foreach loop translates to the delegate function returning a non-zero value. As you can see in code above, a correct implementation of opApply should make sure that the iteration is "cancelled" when the delegate returns a non-zero value.

The traversal function’s argument must match the delegate argument type in the signature of opApply. In the example above the processNode function could modify the tree node that is passed in. If the TreeNode class writer wanted to outlaw such use, the opApply operator should have been declared to take a delegate that takes a const TreeNode parameter:

class TreeNode {
// …
int opApply(int delegate(ref const TreeNode) processNode) {
return processNode(this);
}
}

The new signature demands that the client code changes the parameter type from TreeNode to const TreeNode. Any attempt to modify the node object from within the user-supplied traversal function will fail to compile.

Another possible design is to encode all traversal algorithms as TreeNode methods. The following shows an example for the in-order algorithm (other traversal algorithms are left as an exercise for the reader):

class TreeNode {
// …
int traverseInOrder(int delegate(ref int) dg) {
if (left) {
int r = left.traverseInOrder(dg);
if (r) {
return r;
}
}
int r = dg(value);
if (r) {
return r;
}
if (right) {
r = right.traverseInOrder(dg);
if (r) {
return r;
}
}
return 0;
}
}

foreach(val; &tree.traverseInOrder) {
Console.WriteLine(val);
}


D Generators



A generator is a function or functor that returns a sequence, but instead of building an array or vector containing all the values and returning them all at once, a generator yields the values one at a time. Languages such as C# and Python have a yield keyword for this purpose. In D a generator can be implemented with foreach and a custom opApply operator. Assume one wants to print the prime numbers up to N, like this:

foreach (i; PrimeNumbers()) {
if (i > N) {
break;
}
writeln(i);
}

To make this work, the PrimeNumbers struct could be implemented like this:

struct PrimeNumbers {
int n = 1;
int primes[];

int opApply(int delegate(ref int) dg) {
loop:
while (true) {
++n;
foreach (p; primes) {
if (n % p == 0) {
continue loop;
}
}
primes ~= n;
if (dg(n)) {
break;
}
}
return 1;
}
}

3 comments:

BCSd said...

That is a nice overview of D's foreach system.

However your code examples have one major flaw (I'm sure this is true in D1 but I think it is also true in D2):

The value returned by the delegate in opApply is not just a zero/non-zero value. The actual value is important. If the delegate returns a non-zero value, opApply MUST return that value or bazaar things start happening.

The easy way to implement this is:

if(auto ret = dg(val)) return ret;

The under the hood reason for this is that if the body of the delegate has more than one exit path (say a goto, a labeled break, a return, etc.) the foreach code detects which of these to do by examining the return value of opApply.

Cristache said...

BCSd, this is a great observation.

I compiled and tested the code snippets before posting, and they do work for the specific examples (where I break out of the foreach loops), but you are right: the correct implementation is for opApply to return the result of the delegate call.

And this just gives me some ideas for interesting test cases for my compiler back-end.

Thanks!

Witek Baryluk said...

You may be interested in this: http://github.com/baryluk/iterators_2


Here is sample:
http://github.com/baryluk/iterators_2/blob/master/corod_tree.d