Friday, April 25, 2008

Object Materialization

From the moment we included Linq to SQL in the Data Access Layer of our prototype applications, as a nice side-effect we started using POCOs for most entities.

Linq to SQL (as any ORM) enables the use of POCOs. POCO entities have many appeals, most of them based on Persistence Ignorance.  We delegate the persistence responsibility to our Data Access Layer, where we have code to materialize (read from a data source) and persist (save to a data source) entities (as in-memory objects).

Object Materialization, when working with ADO.Net, means projecting Object Collections from DataReaders, populating the properties of a custom object with the fields of a data record.

Linq to SQL performs this task internally, but it's limited to Linq queries over MS SQL databases. Now that we have POCO entities, it would be nice to obtain them from different data sources, like *-SQL, ODBC, Excel spreadsheets, CSV, or anything implementing an IDataReader interface.

As Jose Marcenaro anticipated here, and here, we want to design a Data Access Layer were Linq to SQL and Enterprise Library DAAB can live in harmony, that means: transparently sharing the same entity model.

Our primary objective is a custom stored procedure invocation based on Enterprise Library, projecting the same class of entities living in Linq to SQL Dbml files. Enterprise Library (or pure ADO.Net) queries deliver DataReaders (or DataSets), so the piece that's missing here is Object Materialization.

Like an alchemist seeking for a process to turn lead into gold, I began my quest for a Linq compatible ADO.Net Object Materializing mechanism.

I'm gonna show 3 different attempts, and at the end of this post, you can download a simple benchmarking app with all the different mechanisms I tried on.

First Attempt, FieldInfo.SetValue()

Fist of all, I noticed that an object materializer is a necessary part of any Linq Provider, and found an "official sample" in the Matt Warren blog, in a series of posts about "Building an IQueryable Provider".

http://blogs.msdn.com/mattwar/archive/2007/07/31/linq-building-an-iqueryable-provider-part-ii.aspx (below the title "The Object Reader")

He shows a simple Object Reader (aka Materializer), described as:

"The job of the object reader is to turn the results of a SQL query into objects. I’m going to build a simple class that takes a DbDataReader and a type ‘T’ and I’ll make it implement IEnumerable<T>. There are no bells and whistles in this implementation. It will only work for writing into class fields via reflection. The names of the fields must match the names of the columns in the reader and the types must match whatever the DataReader thinks is the correct type."

Basically, what this Object Reader does is:

  • Use Reflection over the target object type, to obtain the collection of FieldInfos.
  • Map the names in the FieldInfos with the DataReader field names.
  • While iterating the DataReader, use the FieldInfo.SetValue() method to populate the new target object type instances.

When working with Reflection performance is the first we worry about. As he advices, this is not a real world implementation, the use of Reflection to set the field values resulted very expensive.

Just to make it more Linq-compatible, I modified this object reader to look at properties instead of fields, setting private fields when its specified in a ColumnAttribute, like this:

private int _OrderID; 

[Column(Storage="_OrderID", AutoSync=AutoSync.OnInsert, DbType="Int NOT NULL IDENTITY", IsPrimaryKey=true, IsDbGenerated=true)]
public int OrderID
{
...
}


This is what Linq to SQL does.


The performance remained almost unchanged, because the cost of the initial lookup of Property/FieldInfo is negligible compared to the (NumberOfFields * NumberOfRecords) SetValue invocations.


This option becomes extremely unperformant when reading more than 500 rows. Its intended for didactic purposes only.


Dynamic Translator Lambda Expression


My second attempt were the most fun and educational prototype I wrote about Linq. As the previous attempt showed, using Reflection to populate fields must be avoided, The most simple and performant way to do this job is:




    while (dataReader.Read()) {

Pet pet = Translate(dataReader);

yield return pet;

}

function Pet Translate(IDataRecord dr) {

return new Pet {

Id = (int)dataReader.GetValue(0),

Name = (string)dataReader.GetValue(1),

Birthdate = (DateTime)dataReader.GetValue(2)

}

}



But life isn't that easy, I don't know fields names and positions until runtime, even further I want generic code, independent of the entity type (e.g. Pet).


Note that the Translate function above contains only one Object Initializer (new in C# 3.0), it could be wrote as a Lambda Expression (new in C# 3.0 too)




    Func<IDataRecord,Pet> Translate = (IDataRecord dr => new Pet {

Id = (int)dataReader.GetValue(0),

Name = (string)dataReader.GetValue(1),

Birthdate = (DateTime)dataReader.GetValue(2)

}



Again, this code can't be hardcoded, how can we create this Lambda Expression dynamically at runtime? with Expression Trees (yes, new in C# 3.0 too!)


In C# 3.0 we can programmatically build Expression Trees and compile them later into function delegates. We're going to use the Reflection info to build the above Func<IDataRecord,*>. Once it's compiled is (almost) as fast as directly getting values from the DataReader as shown above.


The code looks a little scary because it uses (and abuses of) Linq to Objects over the PropertyInfo collection to build the Expression Tree: it's like "Linq to Linq". I found big Lambda Expressions a little difficult to indent (don't worry you can download the source files at the end :))




using System; 
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.Data;
using System.Reflection;
using System.Linq.Expressions;
namespace ObjectMaterializer
{
public static class TranslatorBuilder
{

/// <summary>
/// Dynamically creates a Translator Lambda Expression to project T instances from the IDataRecord
/// </summary>
/// <typeparam name="T">The projected type</typeparam>
/// <param name="record">A source data record</param>
/// <returns></returns>
public static Expression<Func<IDataRecord, T>> CreateTranslator<T>(IDataRecord record)
{
// get properties info from the output type
Dictionary<string, PropertyInfo> propInfos = typeof(T).GetProperties().ToDictionary(pi => pi.Name);

// get field names in the DataRecord
var fieldMapping = new Dictionary<int, PropertyInfo>();
for (int i = 0; i < record.FieldCount; i++)
{
string name = record.GetName(i);
if (propInfos.ContainsKey(name))
fieldMapping[i] = propInfos[name];
}

// prepare method info to invoke GetValue and IsDBNull on the IDataRecord
MethodInfo rdrGetValue = typeof(IDataRecord).GetMethod("GetValue");
MethodInfo rdrIsDBNull = typeof(IDataRecord).GetMethod("IsDBNull");
// prepare reference to the IDataRecord rdr parameter
ParameterExpression rdrRef = Expression.Parameter(typeof(IDataRecord), "rdr");

/** builds the translator Lambda Expression
*
* assing each property to its matching field, e.g.:
* new T {
* PropertyName1 = (ProperCast1)rdr.GetValue(ordinal1),
* PropertyName2 = rdr.IdDbNull(ordinal2) ? (ProperCast2)null : (ProperCast2)rdr.GetValue(ordinal2),
* ...
* }
*
* Note that null values on non-nullable properties will throw an Exception on assignment
*
* **/
Expression<Func<IDataRecord, T>> proj = (Expression.Lambda<Func<IDataRecord, T>>(
Expression<T>.MemberInit(Expression<T>.New(typeof(T)),
fieldMapping
.Select(fm =>
Expression.Bind(fm.Value,

((!fm.Value.PropertyType.IsValueType) || (fm.Value.PropertyType.IsGenericType && fm.Value.PropertyType.GetGenericTypeDefinition() == typeof(Nullable<>))) ?
//accepts nulls, test IsDbNull
(Expression)Expression.Condition(Expression<bool>.Call(rdrRef, rdrIsDBNull, Expression<int>.Constant(fm.Key)),
Expression.Convert(Expression.Constant(null), fm.Value.PropertyType) // value is System.DbNull, assign null
,
Expression.Convert(
(fm.Value.PropertyType == typeof(System.Data.Linq.Binary)) ? // convert byte[] to System.Data.Linq.Binary

(Expression)Expression.New(typeof(System.Data.Linq.Binary).GetConstructor(new Type[] { typeof(byte[]) }),
Expression.Convert(Expression.Call(rdrRef, rdrGetValue, Expression<int>.Constant(fm.Key)), typeof(byte[])))

:
(Expression)Expression.Call(rdrRef, rdrGetValue, Expression<int>.Constant(fm.Key)) // value is not-null, assign
, fm.Value.PropertyType)
)

:
// doesn't accept nulls, direct assign
(Expression)Expression.Convert(Expression.Call(rdrRef, rdrGetValue, Expression<int>.Constant(fm.Key)), fm.Value.PropertyType)

) as MemberBinding
)
)
, rdrRef
));

return proj;
}

}
}


Note that timestamps, returned as byte[] by ADO.Net, are transformed into System.Linq.Binary by Linq to SQL, I added support for that.


Now we can use this TranslatorBuilder like this:




// generic method
public IEnumerable<T> ReadAllObjects<T>(IDataReader reader){

Expression<Func<IDataRecord,T>> translatorExpression = TranslatorBuilder.CreateTranslator<T>(reader);
Func<IDataRecord,T> translator = translatorExpression.Compile();

while (reader.Read())
{
T instance = translator(reader);
yield return instance;
}
}
public IEnumerable<Pet> ReadAllPets(IDataReader reader) {
return ReadAllObjects<Pet>(reader);
}


This is pretty elegant and performs great... but is not enough for us.


The Linq Object Materializer normally sets the private fields to avoid invoking the public property setters. This is not only to avoid a performance overhead, but because public property setters are often used for change tracking. With a Lambda Expression (or any C# expression) we can't access private fields.


Here I almost surrendered. How can I set private fields without using Reflection?


If Microsoft guys can, we can!, my boss (Jose Marcenaro) told me about an advanced and mysterious .Net feature, brought with the .Net 2.0 Framework: LCG, Lightweight Code Generation.


Googling around I found that LCG is what the Linq to SQL team used to build their Object Materializer, used on the DataContext.Translate() function.


Wait!, Why not just use the DataContext.Translate() function?, because:

  • It works only for Microsoft SQL Server databases
  • It requires an open db connection as a parameter
  • It requires .Net 3.5 (LCG is in .Net 2.0). Of course using an Object Materializer in .Net 2.0 if you don't have Linq to Objects may not sound so interesting.

Lightweight Code Generation


Since .Net 2.0 under the namespace System.Reflection.Emit are a couple classes that allow to programmatically generate dynamic methods from MSIL (Microsoft Intermediate Language) instructions. It's like adding at runtime a little piece of pre-compiled code.


Using this, we can build at runtime fast methods to set or get a field or property (even private ones). Here you may think "IL instructions??? I don't want to learn a low-level programming language for this!!". Relax, you only need 5 MSIL instructions.


Here's a helping class that generates a field set:




using System; 
using System.Collections.Generic;
using System.Text;
using System.Reflection;
using System.Reflection.Emit;

namespace ObjectMaterializer
{
public static class AccessorBuilder
{

public delegate void MemberSet<T>(T obj, object value);

public static MemberSet<T> CreateFieldSet<T>(FieldInfo fi)
{
Type type = typeof(T);

DynamicMethod dm = new DynamicMethod("Set" + fi.Name, null, new Type[] { type, typeof(object) }, type);

ILGenerator il = dm.GetILGenerator();
// load the target object instance (argument 0) in the stack
il.Emit(OpCodes.Ldarg_0);

// load the new value (argument 1) in the stack
il.Emit(OpCodes.Ldarg_1);

if (fi.FieldType.IsValueType)
// if field contains a value type, we need to unbox it
il.Emit(OpCodes.Unbox_Any, fi.FieldType);
else
// if field contains a non-value type, we need to cast it
il.Emit(OpCodes.Castclass, fi.FieldType);

// set fi object's field value from the stack
il.Emit(OpCodes.Stfld, fi);

// return the value on the top of the stack
il.Emit(OpCodes.Ret);

return (MemberSet<T>)dm.CreateDelegate(typeof(MemberSet<T>));

}

}

}

Lightweight Generated Code, and MSIL are advanced subjects, you can find a lot of samples googling around.


Using this field setters, I improved my first attempt, replacing FieldInfo.SetValue() with this dynamically generated methods, which once compiled into delegates perform as fast as conventional methods.


Later, I added a Field Setters Cache, to avoid building this dynamic methods again on every query.


Some benchmarking shows that this approach is (almost) as fast as the DataContext.Translate() function. There's a small performance difference yet, Why? If anyone can tell me, I'll be glad to update this post! :)


Anyway, our primary objective (in bold at the very beginning of this post) is achieved!!!, (it wasn't to beat the Linq to SQL Object Materializer performance).


The code


I put all these mechanisms in a simple benchmarking app that you can download here


  • First Attempt (using FieldInfo.SetValue): SimpleObjectReader.cs
  • Dynamic Translator Lambda Expression: TranslatorObjectReader.cs
  • LCG setters: LinqObjectReader.cs
  • re-using the DataContext.Translate() function: LTSObjectReader.cs

To run this you will need VisualStudio 2008, .Net 3.5 Framework and a Northwind db, which connection string you can set in the app.config.

2 comments :

Samuel said...

Félicitation pour cet article!

À mon avis, il devrait y avoir beaucoup plus d'articles pour ce sujet car à l'exception d'utiliser Linq To Sql, il n'y a pas d'autre façon de faire rapidement la conversion d'une application "Brownfield" vers de la programmation orientée-objet.

Je préfère encore aujourd'hui continuer à utiliser la formulation des requêtes à l'ancienne façon car je trouves que pour des requêtes complexes, il n'y a pas assez de support encore.

Un convertisseur de DataReader en objets est donc la solution pour moi.

Le gros problème qui restait était la performance mais avec plusieurs tests que j'ai fait de mon côté, la création d'objet via le ILGenerator est quasiment aussi rapide que de créer ses objets manuellement.

Juste pour connaissance de cause, voici le résultat d'un petit test que j'ai fait pour convertir 10 000 lignes d'un DataReader contenant 3 colonnes String en object avec le ILGenerator:

Reflection: 150 millisecondes
Manuellement: 8 millisecondes
ILGenerator: 16 millisecondes

Vive ILGenerator !!!

Benjamin Eidelman said...

@Samuel,désolé, je ne parle français :)

But, thanks a lot for yout feedback!, I'm glad to see your benchmark results! I did a benchmark myself when I wrote this article, but forgot to publish the results.

Viva el ILGenerator :), I saw a funny blog title some days ago, "Getting drunk on the power of Reflection.Emit" :D