2.2k
views13
comments

I have a function "genseq param" that generates short sequences and another function "testseq" that calls genseq in a loop and processes its results. I noticed that when I replace genseq so that it writes array instead of generatings sequence and then testseq processes this array, my application runs 10 times faster.

I wrote a simple test, attached below, and indeed the array-based code runs in 1.1 seconds versus 11 sec for the sequence code, if compiled without optimizations. With optimizations the sequence code completes in 10 seconds, while array-based in 0.12 seconds. I tried inlining genseq or replacing tuples with structs, but the array code was still 10 times faster (both in this toy app and in my actual code). Using Struct resulted in the same time as using the sequences.

I find this contradicting to my intuition that sequence code should be FASTER as it uses state machine and does not have memory dependencies.

Is there some way to produce/consume sequences without killing speed due to (it seems to me the problem) constant garbage collection?

------------------------------------------------------------------------------------
open System.Diagnostics

[<Struct>]
type Pair =
val x: int
val y: int
new (x,y)={x=x;y=y}
let inline (|Pair|) (p: Pair) = (p.x, p.y);;

let inline genseq k =
seq{
for i = 0 to k-1 do
yield (i,i+1)}

let testseq n =
let seq = genseq 1000
let mutable maximum = 0
for i = 0 to n do
for (x,y) in seq do
maximum <- max maximum (max x y)
maximum ;;

let a = Array.create 1000 0
let b = Array.create 1000 0

let genseq2 k =
for i = 0 to k-1 do
a.[i] <- i
b.[i] <- i+1 ;;

let testseq2 n =
let mutable maximum = 0
let k = 1000
for i = 0 to n do
genseq2 k
for j =0 to k-1 do
maximum <- max maximum (max a.[j] b.[j])
maximum

let inline genseq3 k =
seq{
for i = 0 to k-1 do
yield new Pair(i,i+1)}

Using Seq.unfold to generate sequence you get 15-20% speedup.

let inline genseq k = Seq.unfold (function i when i < k -> Some ((i,i+1),i+1)|_ -> None) 0;;

By ssp on 7/21/2009 4:31 AM (permalink)

I have not tried compiling your code, but at a glance, I think the thing you are measuring is all allocation. Your 'seq' example is actually a 'allocate tuples' example, whereas you array example allocates nothing (other than the two arrays, allocated once). So you're not really comparing 'seq' versus 'not', you're comparing 'allocating a zillion objects' versus 'not'. It's not that seqs are slow, rather it's that your seq example is doing lots of allocation that the other example does not.

(You said you changed tuples to structs and saw no practical difference; that would surprise me in this particular example, can you share the 'struct' version of the code? I would expect it to be much faster than tuples in this example, due to less allocation.)

(All that said, as Stephan points out, if the 'work in the loop body' is still small enough that a couple of virtual method calls will dominate that work, then I would still expect the array to be much faster.)

By brianmcn on 7/20/2009 12:49 PM (permalink)

Added structure-based code and times to the first post. Results are the same as with the sequence-based code.

I would have hoped that since we have simple objects that are consumed as they are produced, there would be no allocations, only register operations. For example, it's my assumption, doing the same stuff with a sequence of ints could be faster than using array as temporary. I was just looking into a way to use sequences as using global arrays is ugly. Unfortunately this cannot be done by generating two sequences - one for first element, another - for the second - of the tuple.

To make the comparison more fair, I changed the array sequence generator to:
let genseq2 k =
for i = 0 to k-1 do
let p = (i, i+1)
a.[i] <- (fst p)
b.[i] <- (snd p) ;;
and ran without optimizations. So apparently now the object is created in all cases. Still this code runs 5 times faster than the sequence-based code.

By yanov on 7/20/2009 1:25 PM (permalink)

> Added structure-based code and times to the first post. Results are the same as with the sequence-based code.
[Update: see my next post below.] If your iteration uses the |Pair| active pattern, a tuple still gets allocated, because the F# compiler doesn't yet optimize away the tuples constructed for the arguments of active pattern constructors. You can see this if you inspect the generated code using Reflector.

By Stephan on 7/20/2009 1:58 PM (permalink)

Avoiding active patterns (see below) did not change anything. This code runs 11 sec, vs 10.8 sec for the sequence-of-tuples code, i.e. even slower! Is there a case where consuming just produced sequence does not require allocations, just register operations? My code consists of many short filter-like functions consuming and producing sequences, connected in pipelines. I was under wrong impression that this should be more efficient than reading/writing to memory as the compiler knows there are no data dependencies...

-------------------------------------------------------------------------

let testseq3 n =
let seq = genseq3 1000
let mutable maximum = 0
for i = 0 to n do
for el in seq do
maximum <- max maximum (max el.x el.y)
maximum ;;

By yanov on 7/20/2009 5:06 PM (permalink)

> Avoiding active patterns (see below) did not change anything.

I just had a look at the generated code and have to retract my earlier post: F# indeed optimizes away the tuple in this case, which explains the similar performance. I mistakenly inferred from my previous experience with active patterns that a tuple allocation would be the problem here.

> Is there a case where consuming just produced sequence does not require allocations, just register operations.

No.* And btw, differentiating between "allocations" and "register operations" doesn't make too much sense, since any allocation is implemented using register operations and there aren't many computations that can be implemented without at least stack allocations. You probably meant to discuss the contrast between heap and stack allocations.

> My code consists of many short filter-like functions consuming and

producing sequences, connected in pipelines? I was under wrong

impression that this should be more efficient than reading/writing to

memory as the compiler knows there are no data dependencies...

You are probably a little bit too optimistic about the kind of optimizations that the compiler applies. The F# compiler doesn't yet attempt to fuse pipelined sequence operations, and the JIT can't do this kind of optimization either.

* Theoretically the IEnumerator could be implemented as a struct, which would give the compiler the chance to emit more efficient iteration code if it knows the full type of the enumerator in advance. For example, the System.Collections.Generic.List enumerator is actually implemented as a struct and the C# compiler exploits this when possible. However, as soon as you pass the sequence to a function taking an IEnumerable argument, the iteration can no longer be optimized. Considering how sequences are typically used, this makes the whole optimization rather pointless (and F# doesn't support it anyway).

By Stephan on 7/21/2009 12:36 AM (permalink)

As the functions connected in pipelines are tight loops consuming and

producing complex numbers, without any data dependencies, I *hoped*

that in the end everything will run in registers.

My question was not about complaining that my code runs orders of

magnitude slower than I expected. I ask if there is any way to HELP the

compiler understand that heap allocations are not required short of

rewriting the whole application as a single function.

> Avoiding active patterns (see below) did not change anything.
> Is there a case where consuming just produced sequence does not require allocations, just register operations?
No.* And btw, differentiating between "allocations" and "register operations" doesn't make too much sense, since any allocation is implemented using register operations and there aren't many computations that can be implemented without at least stack allocations. You probably meant to discuss the contrast between heap and stack allocations.
> My code consists of many short filter-like functions consuming and
producing sequences, connected in pipelines. I was under wrong
impression that this should be more efficient than reading/writing to
memory as the compiler knows there are no data dependencies...
You are probably a little bit too optimistic about the kind of optimizations that the compiler applies. The F# compiler doesn't yet attempt to fuse pipelined sequence operations, and the JIT can't do this kind of optimization either.
* Theoretically the IEnumerator could be implemented as a struct, which would give the compiler the chance to emit more efficient iteration code if it knows the full type of the enumerator in advance. For example, the System.Collections.Generic.List enumerator is actually implemented as a struct and the C# compiler exploits this when possible. However, as soon as you pass the sequence to a function taking an IEnumerable argument, the iteration can no longer be optimized. Considering how sequences are typically used, this makes the whole optimization rather pointless (and F# doesn't support it anyway).

By yanov on 7/21/2009 7:48 AM (permalink)

No, the current compiler is not capable to transform multiple pipelined sequence operations into a single optimized loop. There is also no way to iterate over a seq without incurring the overhead of the IEnumerable interface.

However, you don't need to "rewrite the whole application as a single function" in order to work with arrays. You can still factor your computations into several independent functions that operate on arrays instead of seqs.

By Stephan on 7/21/2009 8:17 AM (permalink)

This optimization theoretically looks rather simple, at least until one goes into details. To pipeline filter1 and filter2 one can just substitute statement "yield res1" with the code for filter2.

If pipelining simple F# filters is considered idiomatic style, this is a very important optimization.

A little off topic. Does C# compiler optimize this for IEnumerables?

--------------------------------------

    
 let filter1 seq1 =
    seq{
    for el1 in seq1 do
      something1
      yield res1}
      
 let filter2 seq2 =
    seq{
    for el2 in seq2 do
      something2
      yield res2}  
 
 let filter1_pipe_filter2 seq1 =
    seq{
    for el1 in seq1 do
      something1
      let el2 = res1 
      something2
      yield res2}

By yanov on 7/21/2009 1:46 PM (permalink)

It may not be entirely on-topic, but are you dealing with 'faster' or 'fast enough'? What I mean is, are you looking at raw processing deltas, or practical usage times? It doesn't really change the results, but I like to remind people of the trap of pre-mature/unnecessary optimization.

By Paulmichael Blasucci on 7/20/2009 7:09 PM (permalink)

My actual application runs 10 minutes with the sequence-based code and 2 minutes with writing temporary arrays...

It may not be entirely on-topic, but are you dealing with 'faster' or 'fast enough'? What I mean is, are you looking at raw processing deltas, or practical usage times? It doesn't really change the results, but I like to remind people of the trap of pre-mature/unnecessary optimization.

By yanov on 7/20/2009 10:30 PM (permalink)

Seqs are much slower than arrays, as you noticed yourself. Everytime you iterate over a seq an enumerator needs to be created, and then you need two interface calls (MoveNext + Current) for every single item you iterate over. Compare that with a simple indirect memory fetch for an array. I haven't looked at your sample in detail, but I'd guess that the JIT doesn't eliminate the bounds checks. So you could make the array version even faster by coding it such that the JIT recognizes the array iteration and eliminates the bounds checking (for at least one of the arrays), though that is currently difficult to do in F#.

As Brian notes below, the performance of your seq implementation could be significantly improved. However, I'd still expect that a simple iteration over a seq is slower by at least a factor of 10 relative to an optimized array iteration. Hence, if you have a "tight loop" with a relatively small loop body and you need optimal performance, you should prefer an array.

By Stephan on 7/20/2009 12:42 PM (permalink)

Seqs are much slower than arrays, as you noticed yourself. Everytime you iterate over a seq an enumerator needs to be created, and then you need two interface calls (MoveNext + Current) for every single item you iterate over. Compare that with a simple indirect memory fetch for an array. I haven't looked at your sample in detail, but I'd guess that the JIT doesn't eliminate the bounds checks. So you could make the array version even faster by coding it such that the JIT recognizes the array iteration and eliminates the bounds checking (for at least one of the arrays), though that is currently difficult to do in F#.
As Brian notes below, the performance of your seq implementation could be significantly improved. However, I'd still expect that a simple iteration over a seq is slower by at least a factor of 10 relative to an optimized array iteration. Hence, if you have a "tight loop" with a relatively small loop body and you need optimal performance, you should prefer an array.

I would add a little to the above in that the reason that the array code is faster is that it is using direct coding and not functional coding, thus completely avoiding parameter passing and stack manipulations necessary for the indirect method table calls needed for MoveNext() and Current. Once you realize that the extra time is being used by the enumeration, there is no point to writing the values out to arrays and then reading them back as you may as well combine the operations into one pass as follows, which is slightly over twice as fast as the direct array technique:

let testseq n =
  let mutable maximum = 0
  for i = 0 to n do
    for j = 0 to 1000 do
      maximum <- max maximum (max j (j + 1))
  maximum

Basically, the answer to the subject line is that sequences aren't suitable for tight loops as compared to using imperative code due to the overheads of the IEnumerator interface, just as Stephan said. The current F# core library version 2.0.0.0 implementation of sequences is very inefficient due to being written too general and internally doing even more calling of internal methods per iteration loop. If you really want to use sequences, you can make them run a tight loop about six times as fast as the built in range sequence by wrapping your iteration function in your own enumeration as follows:

//enum used internally
type internal State =
  |Reset = 0
  |Running = 1
  |Finished = 2

//the sequence function
let inline doit i = i

//The enclosing implementation of the IEnumerator interface
type sequ(k) =
  let mutable i = 0
  let mutable st = State.Reset
  interface IEnumerator<int*int> with
    member this.Dispose() = ()
    member this.Current with get() = doit i
  interface IEnumerator with
    member this.Reset() =
      i <- 0

      st <- State.Reset
    member this.MoveNext() =
      if st = State.Running then
        i <- i + 1
        if i >= k then
          st <- State.Finished
        true
      elif st = State.Finished then false
      else
        st <- State.Running
        true
    member this.Current with get() = doit i :> obj

//implemented using an object expression
let genseq k =
  { new IEnumerable<int*int> with
      member this.GetEnumerator() = (new sequ(k) :> IEnumerator<int*int>)
    interface IEnumerable with
      member this.GetEnumerator() = ((new sequ(k)) :> IEnumerator)
  }

The above code is very similar to built-in library's Seq.init function but is still many times faster due to the custom optimization of not doing any extra calls. This is still about four times slower than the ultimate imperative coding loop, which is just the problem with functional programming unless the compiler optimizes away the function calls and makes everything inline. The extra indirect calls through the object's method table still eat up about 21 extra clock cycles. C#, which is a little better at optimizing than the current CTP version of F#, reduces the time to run a sequence as implemented above in F# by a little, but the functional call still takes about four times longer than doing something similar not using the enumeration coding.

The Seq.unfold is still many times slower than using the above code.

Using sequences to implement functional code brings up the usual argument of "Why use functional programming?" in that functional programming will always trade off the ultimate in speed for time critical tight loops for elegance of expression and ease of prototyping.

EDIT ADD:

Even the F# version 2.0 CTP released with Visual Studio 2010 still is not very efficient as to running tight loops, primarily due to inefficiencies in the FSharp.Core runtime library. For instance, for reasonably fast loops, F# should be able to run an integer sequence such as {0 .. 1000000000} (one billion) in a reasonable amount of time other than for the IEnumerator overheads of calling the MoveNext() method and (possibly, if the results are required), the Current property. Minimum time including the overheads would be something like 30 clock times per loop, so about 10 seconds on a modern processor for a "do nothing" loop which just say adds them all together. It takes much longer than that, and a little investigation with the Red Gate Reflector program reveals why:

The FSharp.Core library implements the integer IEnumerator sequence by calling the following virtual property from the necessary MoveNext() method as a *virtual* method (calling virtual methods takes longer than calling regular methods), which in turn makes multiple virtual method calls to the Before and Equal virtual methods in order to do it's work. The implementation of Current also makes a virtual call to Result in order to do it's work. The following is in C# format as Reflector does not yet support F# code forms:

public override bool CanStep


{


  get


  {


    TState z = this.z;


    TState local2 = this.Step(this.z);


    if (!this.Before(local2, z))


    {


      if (this.Equal(local2, z))


      {


        return false;


      }


      if (this.Before(z, local2) && this.Before(local2, this.m))


      {


        this.z = local2;


        return true;


      }


      if (this.Equal(local2, this.m))


      {


        this.z = local2;


        return true;


      }


    }


    return false;


  }


}

These extra virtual calls mean that the minimum tight loop that can be run with an integer sequence takes about three times longer than it would without all the (virtual) method calls.

Even functions that have sequences as arguments and return sequences as would be used in "chained" calculations such as Seq.map or Seq.filter use some virtual calls internally in order to do their work.

All of these functions are written for general use without respect to the amount of extra overhead time they may use, which makes F# even more unsuitable for using for tight iterative calculations using the immutable forms of expression even more than would be just to use the IEnumerator inteface directly. This means that the only ways to write these types of loops is to use imperative mutable coding and avoiding the IEnumerable interface implied by "for _ in _ do" loops (fastest) or wrapping the functions in our own IEnumerator interface which does not make virtual calls and which minimizes the number of levels of method calls as in my previous post, but which still takes about three to four times longer than the mutable imperative form.

using seq { for _ to _ do yield ... } forms of sequence should be faster as it implies wrapping your own sequence, but is incredibly slower due to making even more levels of indirect virtual calls.

As an aside, there are also compiler quirks, in that it is recommended that one form sequences with a leading "Seq" as in "Seq { 0 .. 2 .. 25 }" but by using the "seq" there is an additional unnecessary CreateSequence call to the library, as the Range is already a sequence. This Makseq call should be recognized as unnecessary an eliminated by the compiler, not that it make s much difference to speed. Also the shortcut form of "seq { for i in { 0 .. 10 } -> i }" is accepted by the compiler but not "seq { for i = 0 to 10 -> i }" which must be written as "seq { for i = 0 to 10 do yield i }". Signs of unpolished work, and therefore presumably the reason the product still has "CTP" status.

In short, until the first of the following two functions is no slower than about three times than the second instead of the current about ten times slower, the built in IEnumerator's of F# are particularily unsuitable for writing tight loops, in particular when using the immutable forms of enumeration as wrapped in the FSharp.Core runtime library:

let test1 = { 0 .. 1000000000 } |> Seq.length

let test2 =


  let mutable count = 0


  for i = 0 to 1000000000 do


    count <- count + 1


  count

By GordonBGood on 5/10/2010 8:42 PM (permalink)

Topic tags

Built with WebSharper

Home

Answers

Events

Courses

Groups and Conferences

Blogs

Jobs

Developers

Topic tags