1.5k
views3
comments

Hi,

There was in interesting discussion on the Ocaml mailing list today about implementing a byte code interpreter using a continuation passing style.

I was wondering if this would be relevant to HDFS logic simulations in F#. Basically the issue I face is when constructing a simulation one generally ends up with lots of very tiny "tasks". A task is simply a funtion of type unit->unit and a complete simulation consists of an ordered list of these tasks which gets executed with "List.iter (fun x -> x()) tasks". In some cases the task can be quite complex (say evaluation of of a 128x128 bit multiply), but in most cases it'll be something very simple like ((a.[0]+b.[0]) &&& mask). My gut feeling (which could well be totally off) is that in most simulations as much time is being in the overhead of iterating through the task list and calling the function as actually performing the operation itself.

Is there some way in which a CPS style would alleviate this overhead? I can see that tail recusion can be used, but I would expect List.iter to already do that (though perhaps with a small amount of pointer artihmetic). How about the function call overhead?

Note that construction of the tasks itself is relatively smart in that

most overhead (like looking up argument values) is removed at "compile"

time and for special cases (mainly when a results is going to be <= 32 bits) special tasks are used rather than more general versions.

Currently performance is about 1/2 a commercial hardware simulation (which aint bad frankly) but I would be most pleased if it could rival it.

I must admit I am looking for a magic bullet here as I realize the "proper" solution to this would be to generate and compile byte code directly.

Cheers,

Andy

There was in interesting discussion on the Ocaml mailing list today about implementing a byte code interpreter using a continuation passing style.

The best implementation of that program is available as an OCaml Journal article with a detailed description of its design and implementation. The next F# Journal article will be along similar lines.

I was wondering if this would be relevant to HDFS logic simulations in F#. Basically the issue I face is when constructing a simulation one generally ends up with lots of very tiny "tasks". A task is simply a funtion of type unit->unit and a complete simulation consists of an ordered list of these tasks which gets executed with "List.iter (fun x -> x()) tasks". In some cases the task can be quite complex (say evaluation of of a 128x128 bit multiply), but in most cases it'll be something very simple like ((a.[0]+b.[0]) &&& mask). My gut feeling (which could well be totally off) is that in most simulations as much time is being in the overhead of iterating through the task list and calling the function as actually performing the operation itself.

The benefit of transformation to CPS is the ability to hoist computations from the inner loop and replace them with indirections. The way you explain your problem, it sounds as though you've already hoisted everything that can be hoisted. In which case, you'll just pay the price from F#'s slow tail calls when invoking one continuation from the next and CPS will probably end up making your program slower.

Is there some way in which a CPS style would alleviate this overhead? I can see that tail recusion can be used, but I would expect List.iter to already do that (though perhaps with a small amount of pointer artihmetic). How about the function call overhead?

CPS is less likely to help in F# but it sounds as though it would not help in your case anyway.

I must admit I am looking for a magic bullet here as I realize the "proper" solution to this would be to generate and compile byte code directly.

Yes. Fortunately, this is much easier with F# thanks to .NET's built-in IL generation or LINQ. There are sections on these topics in F# for Scientists, BTW.

I am also trying to encourage the F# team to add better support for metaprogramming. After all, the F# compiler itself is a great IL code generator... :-)

By jdh30 on 8/29/2007 3:14 AM (permalink)

List.iter is tail recursive.

Array.iter would be faster.

RE: How about the function call overhead?

Creating a closure corresponds to creating an object descending from a FastFunc* class. Calling it will be a (virtual) method call. Reflector will show what is happening.

You can profile for allocation and speed with the CLRProfiler and Visual Studio sampling.

I also wondered about "ordered list".

Given your insert/consume pattern, could another datastructure give improvements?

One option is System.Collections.Generic.LinkedList.

By jamarg on 8/15/2007 4:09 PM (permalink)

List.iter is tail recursive.
Array.iter would be faster.

Good suggestion. It's should be trivial to see how much difference that makes.

RE: How about the function call overhead?
Creating a closure corresponds to creating an object descending from a FastFunc* class. Calling it will be a (virtual) method call. Reflector will show what is happening.

I'm not well versed in .NET's JIT compiler, so I dont really know what overhead that would imply. Would it basically resolve down to a machine level functional call, pehaps with vtable like lookup? Relatively speaking thats still a big overhead compared to the actual processing going on in <i>most<i> of my "tasks". You can profile for allocation and speed with the CLRProfiler and Visual Studio sampling. I also wondered about "ordered list". Given your insert/consume pattern, could another datastructure give improvements? One option is System.Collections.Generic.LinkedList. I think I explained that badly. I start off with a cyclic graph - each node in the graph corresponds to a "task" function. There are a couple of different types of node in the graph which imply there must be a way to order them linearly or otherwise the circuit is not valid (in hardware parlance - contains combinatorial loops). This is all resolved when the task list is generated. After that it's just a case of running the task list over and over again (each iteration corresponds to one hardware clock cycle). The impression I got was that CPS was a way of flattening out what was really just a sequence of instructions albeit generated dynamically. On the assumption that the function call overhead is potentially relatively big, I thought that might make a significant difference. It's a new idea to me though, so I probably dont understand it properly. Thanks for the response, Cheers, Andy.

By Andyman on 8/15/2007 5:28 PM (permalink)

Topic tags

Built with WebSharper

Home

Answers

Events

Courses

Groups and Conferences

Blogs

Jobs

Developers

Topic tags