6.0k
views9
comments

Using Reflection to Serialize F# data types to/from XML

[See attachment "02-SerializeXml.zip" above, for the F# code discussed in this thread.]

(Having learned to combine Object-Oriented style with Functional style to good effect in "Syntax tree walking in a PEG metacompiler", [link:cs.hubfs.net] , I detour to a different topic, use of Reflection and .NET XML classes.)

Normal .NET data can be easily saved to XML via XmlSerializer class in System.Xml.Serialization: [link:msdn2.microsoft.com]

However, F# data can't (currently) by serialized, because .NET serialization (including XmlSerializer) relies on classes having parameterless constructors, and then setting the fields after construction. Clearly, the designers were used to languages with *mutable* data!

A serializer for F# data to XML and back would be useful to have; and even if there may be some built-in solution eventually, implementing this still makes a nice demonstration of the use of Reflection, as well as how to build Xml documents, or pick through existing Xml documents.

This series of posts demonstrates a partial solution, that could be extended to cover all F# types. Specifically, I have chosen "discriminated unions", because that is (IMHO) the most distinctive F# data type, it is somewhat complex to map onto .NET (as can be seen by examining a disassembly in Reflector), and in combination with pattern-matching is a potent tool to use. Besides, namin posted a challenge, "On Extending the Scheme [sic] Compilation by Reflection on Types of Expert F#", [link:cs.hubfs.net] and this is the key feature needed to address that!

The Task Goal:

Use XML as a data store for a FlowBox, as defined below. But do not write any code specific to FlowBox's; just Reflect on the type passed in. The type declarations are simplified from the Micado original, [link:code.google.com] in order to keep the test data and explanation simple:

type NodeIndex = int

type Wires =
    Complete 
  | Input of NodeIndex    // Input Terminal with wire coming out.
  | Output of NodeIndex   // Output terminal with wire coming in.
  | Thru of NodeIndex * NodeIndex

type FlowBox =
    Prim of Wires  // Primitive
  | Extd of Wires * FlowBox // Extended
  | And of Wires * FlowBox array

First, we'll create some test data.

// ---------- Tests ----------

let ps s = s |> printfn "%A"
let psm s msg = printfn "%A = %s" s msg

let C = Complete
let I w = Input w
let O w = Output w
let T w1 w2 = Thru (w1, w2)
let P w = Prim w
let E w fb = Extd (w, fb)
let A w fba = And (w, fba)
    
let w1 = I 1
let w2 = T 1 2
let w3 = O 2
let fb1 = P w1
let fb2 = E w2 fb1
// Not a very meaningful schematic...
let fb3 = A w3 [| fb1; fb2 |]
psm w1 "w1"
psm w2 "w2"
psm w3 "w3"
psm fb1 "fb1"
psm fb2 "fb2"
psm fb3 "fb3"

== printed output ==>

Input 1 = w1
Thru (1,2) = w2
Output 2 = w3
Prim Input 1 = fb1
Extd (Thru (1,2),Prim Input 1) = fb2
And (Output 2,[|Prim Input 1; Extd (Thru (1,2),Prim Input 1)|]) = fb3

Consider what data we might store if creating an XML document by hand (for code that does that, see my posts in namin's challenge thread, mentioned earlier -- that code is extracted from Micado, simplified to focus on this topic). Note the structure printed for "fb3" in the last line of output shown immediately above. Our by-hand solution stores that as the following XML:

<And input="2">
  
  <Extended input="1" output="2">
    
  </Extended>
</And>

NOTE: If you are unfamiliar with the schematic terminology being used here, it might look like there is an error in this output: fb3 starts with "And (Output 2", and the output starts with 'And input="2"'. The explanation is that the first "Output" refers to an "Output terminal", which is a node having a wire coming in (an input), and no wire going out. I was translating code that stored the data in terms of that latter sense. Sorry if now you are even more confused! In my automated version I don't translate it like this, but I didn't take the time to rewrite the manual one to match.

By ToolmakerSteve on 4/1/2008 7:29 PM (permalink)

Here is a sneak preview of the data we will be outputting from the generalized logic:

// F# structure
And (Output 2,[|Prim Input 1; Extd (Thru (1,2),Prim Input 1)|])

== XML output =>

<FlowBox.And>
  <Wires.Output v1="2" />
  <System.Array elemType="FlowBox">
    <FlowBox.Prim>
      <Wires.Input v1="1" />
    </FlowBox.Prim>
    <FlowBox.Extd>
      <Wires.Thru v1="1" v2="2" />
      <FlowBox.Prim>
        <Wires.Input v1="1" />
      </FlowBox.Prim>
    </FlowBox.Extd>
  </System.Array>
</FlowBox.And>

This is noticeably more verbose than the hand-tuned XML -- why is that? The hand-tuned one (which ONLY understands FlowBoxs and Wires) is designed at a higher conceptual level, by someone who understands the minimum information needed to convey a schematic made of FlowBoxs and Wires, whereas the automatic one simply stores all the objects it finds. The hand one is able to collapse the Wires into integer "input" and "output" attributes directly on the FlowBox.

In addition, the automatic one contains qualified names; e.g. "FlowBox.And" rather than simply "And". I felt this was a safer design choice for an automatic tool, as it insured there was enough information to know what types we are talking about. In practice, the top level element perhaps should also be qualified with the full path to the object; e.g. "MicadoTypes.FlowBox.And", but I didn't want to get into namespace searching, so I instead rely on the code being given a schema parameter "typeof<FlowBox>" to start the ball rolling. That is, a reader of this data needs the type declarations in order to work, and uses Reflection on those types, similar to using an XML Schema.

Each qualified name is either showing "namespace.class" (e.g. "System.Array") or, in the case of a discriminated union, it is showing "basetype.tag" (e.g. "FlowBox.And"). As we shall see, the code uses the passed in type "schema" to know what to expect.

The attributes "v1" and "v2" are names made up by this serializer. The source syntax for a discriminated union (DU) doesn't name the values in each alternative, so I made up short, valid, XML names to indicate "value #1", "value #2", etc. NOTE: internally, DUs use longer names, derived from the alternate's tagname, but I saw no reason to use those longer names.

The <System.Array> XML element describes an array of "FlowBox"s; each element of the F# array becomes one XML child element of that. Here there are two children: array.[0] is a <FlowBox.Prim>, and array.[1] is a <FlowBox.Extd>. The <FlowBox.Extd> contains a <Wires.Thru> and a <FlowBox.Prim>.

Note that this <FlowBox.Prim> with 'Wires.Input v1="1"' appeared earlier in the data; this serializer has no mechanism for referring to shared data. The design should be extended with such a mechanism, for any large-scale use. The consequence of the limitation is that when data is read back in (deserialized), there will be TWO such objects created; they will no longer share a common identity. In addition, if a shared object was the root of a large network of objects, the entire network would be duplicated at each reference. Worst of all, any recursive data reference would lead to an infinite loop constructing the XML document tree. Fortunately, many practical situations can live with these limitations, at least in a first version.

By ToolmakerSteve on 4/1/2008 8:43 PM (permalink)

Lets start with a simpler test case; "w1" in the test data of the second post:

let w1 = I 1

== console output =>
Input 1 = w1

== XML output =>
<Wires.Input v1="1" />

What code is sufficient to output this structure, working only from the Reflection data of "typeof<Wires>"?

To refresh our memory of the type definitions:

type NodeIndex = int

type Wires =
    Complete 
  | Input of NodeIndex    // Input Terminal with wire coming out.
  | Output of NodeIndex   // Output terminal with wire coming in.
  | Thru of NodeIndex * NodeIndex

"NodeIndex" is simply an alias for "int", which is an alias for "System.Int32"; internally this is the same as:

type Wires =
    Complete 
  | Input of System.Int32
  | Output of System.Int32
  | Thru of System.Int32 * System.Int32

I know that the top-level entity is a DU (discriminated union), and that what is wanted is an XmlDocument from namespace System.XML, [link:msdn2.microsoft.com]
All the XML types we will work with are from that namespace.
We will use Reflection, both the general .Net Reflection as well as F# extensions.

For clarity, I define short aliases to those namespaces (rather than simply using "open" on each):
module SR = System.Reflection
module X = System.Xml
module FR = Microsoft.FSharp.Reflection
module FV = Microsoft.FSharp.Reflection.Value

Given that, here is the specification of the function at the top of the translation to XML:
/// Build a document with du instance as its main element.
let buildDocForDU (rootDu) :X.XmlDocument = ...

What will this function do? This function creates an XML Document, and then creates an XML Element to be the root element in that document. Since we know that a DU might reference other DUs, the DU => XML Element creation is subcontracted to a (recursive) function. A natural design recurses down to the smallest detail, which then passes back its result, which gets aggregated into larger results as passed back up the function stack. One wrinkle: XML Elements and other XML Nodes can only be created (in Microsoft's implementation) by the containing XML Document, so we have to pass a "doc" parameter down through all the functions. I've said a lot; here is the resultant code:

  /// Build an XmlNode for a discriminated union instance.
  /// REQUIRE: du is a discriminated union instance.
  /// du has 'tag' and 'values'.
  /// Make element name from baseName and tag name.
  let rec du2Xml (doc:X.XmlDocument) (du:obj) givenName :X.XmlElement = ...

  /// Build a document with du instance as its main element. 
  let buildDocForDU (rootDu) :X.XmlDocument =
    let doc = new X.XmlDocument()
    doc.AppendChild (du2Xml doc (rootDu :> obj) "") |> ignore
    doc

I am deliberately showing only the contract of each method the first time it appears, to clarify the result of each implementation decision. This also emphasizes my rationale for including type information on most parameters, even when that is not strictly required. I believe that it is easier for human readers if they can determine what the author intends simply by looking at a function's contract, without needing to examine the body that implements that contract.

Two additional details need explanation. First, "givenName" got added later in the design, once I got two-levels deep into the tree. Typically, an object is referenced by some (field or property) name in its parent; in XML it turned out to work most smoothly to keep that name with the data being referred to. We'll see this in use shortly.

Second detail, the need to cast "rootDu" to "obj". There is no base "DU" class (like there is an "Array" class for arrays), so in order to talk about instances of any DU, we need to be talking about "obj"s.

There is also something "absent" compared to a normal XML translator -- I didn't pass in a schema object. Here, the schema is the object's type, and will be obtained when needed by "ob.GetType()".

NOTE: I don't like having to put function declarations before their use. In C#, I would put the "most important" (and first needed by human reader) function "buildDocForDU" first, with "du2Xml" below it.

By ToolmakerSteve on 4/1/2008 10:05 PM (permalink)

I wanted to start you into the code slowly; now I pick up the pace. Hey its a forum thread -- if I move too fast, you can post a question!

Now might be a good time to find the link at the top of the first thread, and download the whole code sample, so you can peruse it. The section I am describing starts with:

1
2
3

/// -----------------------------------------------
/// ---------- Build XML tree from data. ----------
module FObSerialize =

The main set of functions are recursively grouped:

  /// Attach Xml representation of ob to parent.
  /// "forceChild" is true if ob is to be represented as a childNode,
  /// even when simple.  E.g. "<int v="123" />" rather than 'v1="123"'
  let rec attachXml (doc:X.XmlDocument) (parent:X.XmlElement) (ob:obj) attrName forceChild = ...

  and attachValueAsChild (doc:X.XmlDocument) (parent:X.XmlElement) (ob:obj) = ...
    
  /// Build an XmlNode for a discriminated union instance.
  /// REQUIRE: du is a discriminated union instance.
  /// du has 'tag' and 'values'.
  /// Make element name from baseName and tag name.
  and du2Xml (doc:X.XmlDocument) (du:obj) givenName :X.XmlElement = ...
  
  /// REQUIRE: value is an array.
  and array2Xml (doc:X.XmlDocument) (ar:obj) givenName :X.XmlElement = ...
  
  /// REQUIRE: FR.Type.GetInfo(value.GetType()) = FR.ObjectType
  and netOb2Xml (doc:X.XmlDocument) (givenName:string) (value:obj) nameOptional :XmlSubEntity = ...
 
  /// REQUIRE: value is an instance of some F# type.
  /// "nameOptional" is true for entities such as d.u. that constructed
  /// arbitrary names simply to keep track of attributes.
  and fOb2XmlSubEntity (doc:X.XmlDocument) (givenName:string) (value:obj) nameOptional :XmlSubEntity = ...

"du2Xml", "array2Xml", and "netOp2Xml" handle the three kinds of entities that this version understands (other types could easily be added): DUs (discriminated unions), arrays, and "other .NET types" (currently only the System.Int32 used in Wires). "attachXml" and "attachValueAsChild" factor out some common code. "fOb2XmlSubEntity" is a switchboard -- for each child to be added to any entity, it determines which type is being dealt with, calling the corresponding "...2Xml" function.

As mentioned previously, considering first the simple case of w1 ==> <Wires.Input v1="1" />, what is the path taken through the code?

TO BE CONTINUED...

By ToolmakerSteve on 4/1/2008 10:37 PM (permalink)

Hey Steve - Great thread!

I too have been thinking alot about XML in F#. Laterly I've been wondering if a gerneral perpose F# library for reading and writing XML might be useful. I would be a doddle to produce an F# style definition of xml:

type Xml =
  | Element of string * (string * string) * Xml // name * attributes * inner xml
  | Text of string
  | Comment of string 
  | ProcessingInstruction of string
  | Empty

(Except this doesn't really address the namespace issue - which add a huge compilication)

And it would be easy to provide functions then read and write from this structure, use one of the .NET many existing xml libraries.

Would such a library be useful for these kinds of projects?

I'd also be interest in automatically generating libraries for writing xml from an xml schema. It seems me there's an awful lot of data stored in xml configuration files, it occurred to me recently that it would be nice to be able to have an F# definition of these files to remove some the verboseness and repatativity that xml config files typically have.

I'd be hoping eventually to publish this library to codeplex of sourceforge, would you be interested in joining forces?

Cheers,
Rob

By Robert on 4/5/2008 11:04 PM (permalink)

(Except this doesn't really address the namespace issue - which add a huge compilication)

What namespace issue are you referring to? Can't you deal with whatever the issue is simply by defining short nickname(s) for the namespace(s) in question? E.g.

module X = System.Xml

Or are you saying the problem is that you need a different namespace for each XML Schema you are working with, and that will be hard to code something general for?

One hack might be to have a "standard" nickname that you will define to point to a namespace holding the type definitions for the Schema you are working with. e.g.

module XS = Myschema

Or if you are referring to a different kind of namespace issue, sometimes the solution is to define a set of functions in file A with an "open" into namespace B, so that those functions can use B freely. Then in file C you open "A" (but not B), so that you get indirect use of what you need in B, without actually referring to B from C. A acts as an adapter, giving C just what it needs.

There is an interesting example of this in the FParsec sources, where there is a parser combinator named "string" which one uses in a parser file, instead of the System.String's "string" alias. Why? Maybe this was a standard combinator name from the original Parsec in Haskell. Or maybe because in a parser, one is always wrapping return values into a Reply<_,state> structure, and whenever you want a string in the parser, what you really mean is you want a "Reply<System.String,state>". OK, I'm talking through my hat here, but the override got me thinking...

By ToolmakerSteve on 4/6/2008 12:12 AM (permalink)

I was actually refering to xml namespaces, I'm not an expert on them, but they seem to add a huge amount of complexity when using them thought the XML DOM. I do not think .NET namespaces will be an issue.

Also thinking about about you'd probably need an ElementList as well, to represent nodes that are siblings

type Xml =
  | Element of string * (string * string) * Xml // name * attributes * inner xml
  | ElementList of list<Xml> 
  | Text of string
  | Comment of string 
  | ProcessingInstruction of string
  | Empty

Cheers,
Rob

By Robert on 4/6/2008 2:38 AM (permalink)

Actually something even cooler would be the XML type could just be an active pattern over the System.Xml.XmlReader, which would be the same thing but lazy!

By Robert on 4/5/2008 11:07 PM (permalink)

Yes, Robert, lets make this happen. I've sent you e-mail :)

By ToolmakerSteve on 4/6/2008 12:02 AM (permalink)

Topic tags

Built with WebSharper

Home

Answers

Events

Courses

Groups and Conferences

Blogs

Jobs

Developers

Topic tags