Sunday, February 08, 2009

D-elegating Constructors

The D programming language allows a constructor of a class to call another constructor of the same class, for the purpose of sharing initialization code. This feature is called "delegating constructors"; it is also present in C# and in the emerging C++ 0x.

C#'s syntax for delegating constructors resembles the initializer lists in C++, and strictly enforces that the delegated constructor is called before any other code in the body of the caller constructor; the feature is masterfully explained in More Effective C#: 50 Specific Ways to Improve Your C# (Effective Software Development Series).

D is more flexible, a constructor can be called from another constructor's body pretty much like any other "regular" method, provided that some simple rules are observed (for example, it is not permitted to call a constructor from within a loop).

A D compiler must detect constructor delegation and ensure that some initialization code is not executed more than once. Let's consider an example:

class Example
{
int foo = 42;
int bar;

this()
{
bar = 13;
}
this(int i)
{
foo = i;
this();
}
}

In the first constructor, before the field bar is assigned the value 13, some "invisible" code executes: first, the constructor of the base class is invoked. The Example class does not have an explicit base; but in D, similar to Java and C#, all classes have an implicit root Object base. It is as if we wrote:

class Example : Object
{ ...
}

After generating the call to Object's constructor, the compiler generates the code that initializes foo to 42. The explicit assignment as written by the programmer executes after wards.

The compiler must be careful so that the initializations steps described above happen only once in the second constructor. This is not simply a matter of efficiency; it is more importantly, a matter of correctness. If calling the base Object constructor and the initialization of foo where generated blindly inside the body of each constructor, then the following would happen in the second constructor's case:

  1. Object's ctor is invoked (compiler generated)

  2. foo = 42 (compiler generated)

  3. foo = i (programmer's code)

  4. constructor delegation occurs (programmer's code), which means that:

  5. Object's ctor is invoked

  6. foo = 42 (compiler generated)


This is obviously incorrect, since it leaves the Example object in a different state than the programmer intended.

Such scenario is very easily avoided by a native compiler. Object creation is translated to several distinct steps:

  1. memory for the object is allocated

  2. invocation of base ctor is generated

  3. initializers are generated (this is where foo = 42 happens)

  4. constructor as written by programmer is invoked


The important thing to note is that in the native compiler's case the compiler leaves the constructors alone, as written by the programmer, and inserts its magic "pre-initializaton" steps in between the memory allocation and constructor invocation.

When writing a compiler back-end for .NET things are slightly different: the creation of an object is expressed in one compact, single line of MSIL (Microsoft Intermediary Language) assembly code:

newobj <constructor call>

In our example, that would be

newobj void class Example::.ctor()

and

newobj void class Example::.ctor(int32)

respectively. So the compiler-generated magic steps of calling the base constructor, etc have to happen inside the constructor body. To prevent the erroneous scenario of double-initialization from happening, I had to generate a hidden, "guard" Boolean field for classes that use constructor delegation. The variable is set when entering a constructor's body; it is checked inside each constructor before calling the base constructor and stuff. Here's how the generated IL code looks like:

//--------------------------------------------------------------
// ctor.d compiled: Sun Feb 08 23:04:49 2009
//--------------------------------------------------------------
.assembly extern mscorlib {}
.assembly extern dnetlib {}
.assembly 'ctor' {}

.module 'ctor'


.class public auto ctor.Example extends [dnetlib]core.Object
{
.field public int32 foo
.field public int32 bar
.method public hidebysig instance void .ctor ()
{
.maxstack 3
ldarg.0
ldfld bool 'ctor.Example'::$in_ctor
brtrue L0_ctor
ldarg.0
call instance void [dnetlib]core.Object::.ctor()
ldarg.0
ldc.i4 42
stfld int32 'ctor.Example'::foo
L0_ctor:
ldarg.0 // 'this'
ldc.i4 13
stfld int32 'ctor.Example'::bar
ret
}
.method public hidebysig instance void .ctor (int32 'i')
{
.maxstack 3
ldarg.0
call instance void [dnetlib]core.Object::.ctor()
ldarg.0
ldc.i4 42
stfld int32 'ctor.Example'::foo
ldarg.0 // 'this'
ldarg.1 // 'i'
stfld int32 'ctor.Example'::foo
ldarg.0 // 'this'
ldc.i4 1
stfld bool 'ctor.Example'::$in_ctor
ldarg.0
call instance void ctor.Example::.ctor ()
ret
}
.field bool $in_ctor
} // end of ctor.Example

As a side note, in the second constructor's case a small redundancy still exists: foo is assigned to 42 only to be set to another value right away. I am hoping that this isn't much of an issue if the JIT engine detects it and optimizes it out. I'd be happy to hear any informed opinions.

1 comment:

Anonymous said...

As far as delegating constructors goes, you could take your 'boolean' logic and extend it further using a state-machine type method. You could do simple code-flow analysis to get an idea of the constructor chain. Based upon this, you could assign each constructor an identifier, using the code-flow analysis from before, you'd know in what cases each could possibly be invoked by, so you could emit check code for the cases where certain fields aren't assigned.

The only issue with this is the optimization to avoid redundancy might be mitigated by jumping through the individual field initializations, since it would likely be reduced to a series of jump tables based upon the overall complexity of the chain(s).

There also might be cases where it's inappropriate to avoid such redundancies. Such as when the initializer is dependent upon incrementing a static value which represents the Identifier of the element, while assigning it later would alter the value, they might, for some unknowable reason, rely on that as some sort of instance counter. To find that their assignment is omitted, in lieu of a parametered constructor that assigns the value, might confuse them a little.