4.4k
views12
comments

f#workflows functional parallel monad async complexity

I'm spinning my wheels trying to understand how the following would be possible or why it would be impossible:

The motiviation for this is business process automation, business process management, workflow, etc. BPA processes are similar to regular programs except that they often do very little computation compared to the length of time that they are concidered running, sitting idle while still (and undesirably) consuming fixed resources as OS processes do. These programs still may perform arbitrarily computation and interact with the outside world through either-way and two-way communication. Also, BPA processes may need to outlive the OS processes that represent it. BPA is difficult to model well with current technologies because OS processes aren't built for this type of long-running, process-unaware model. OS Processes generally start from the beginning and can't be efficiently stored to or revived "in-flight" from disk. Microsoft's WF and others attempt to solve this by using object graphs to represent the computation and to track/persist the logical state of the computation so that it can be persisted and resumed. I've worked with a few workflow products but I'm not satisfied with the complexity of any of them. These object-oriented models are so cumbersome - I'm searching for a better way through functional programming... So that's the context...

Consider the Async monad. Assume that by dicipline, all bound values are serializable and that the monadic functions (as defined using the AsyncBuilder) are pure. Again, this would not require language-enforced purity (although that would be a bonus).

To the Async monad, add functionality that logs the value of all bindings to disk.

//type State<'data> =
// new : by:string * data:'data -> State<'data>
// member Data : 'data
// member ID : string
// member StartedBy : string
// member StartedOn : System.DateTime
// 
//type Task<'data> =
// new : data:'data * assignee:string -> Task<'data>
// member Assignee : string
// member Data : 'data
// member ID : int
//
//type Decision =
// | Approved
// | Rejected;;
//
// val WaitAsync : 'a -> Async<unit>
// val SendEmailAsync : 'a -> 'b -> 'c -> Async<unit>
// val AssignTaskAsync : Task<'data> -> Async<Decision>
// val StringOfTimeNow : Async<string>

let proc1 = new State<string>("User1", "MyData1");;
let proc2 = new State<string>("User1", "Whatever");;

let LongRunningAbstractedProcess(state:State<string>) = async { // 0
    do! SendEmailAsync "<A href="mailto:admin@domain.com">admin@domain.com</A>" "Alert" ("An instance was started: " + state.ID) // 1
    let assignment = new Task<string>(state.Data, "domain\Manager")
    let rec demandApproval (result : Async<Decision>) : Async<unit> = async {
        let! stringNow = StringOfTimeNow // 2
        if (Async.Run(result) = Approved) then
            do! SendEmailAsync "<A href="mailto:admin@domain.com">admin@domain.com</A>" "Alert" ("An instance was approved at " + state.ID) // n + 1
        else
            do! SendEmailAsync "<A href="mailto:admin@domain.com">admin@domain.com</A>" "Alert" ("An instance was rejected at " + state.ID) // 3 ..
            do! WaitAsync 100000 // 4 ..
            do! demandApproval (AssignTaskAsync(assignment)) } // 5 .. n
    do! demandApproval (AssignTaskAsync(assignment)) // 6 .. n + 2
    };;

Async.Run(LongRunningAbstractedProcess proc1);;
Async.Run(LongRunningAbstractedProcess proc2);;

At 0, The Async framework has instantiated a new 
      logging structure that will get threaded through this
      "instance" of the workflow.  The first item on the 
      log queue is the intial state data.  "MyData1" :: []
At 1, The email has been sent returning a simple success value of ().
      The logging structure now contains:
      "MyData1" :: () :: []
At 2.., The demandApproval loop has been started.  This loop is 
      shown as its own Async, but I don't know if that can be
      avoided for this kind of flow control - but for this example
      the log is threaded on to the nested Workflows.  Within, a
      non-deterministic value is bound:
      "MyData1" :: () :: "7:15:33.34432 pm" :: []
At 3.., this next email is sent returning ():
      "MyData1" :: () :: "7:15:33.34432 pm" :: () :: []
At 4.., We wait just for fun, returning ()
      "MyData1" :: () :: "7:15:33.34432 pm" :: () :: () :: []
At 5.., Assuming they Approved,
      "MyData1" :: () :: "7:15:33.34432 pm" :: () :: () :: () :: []
At 6.., the nested Async returns,
      "MyData1" :: () :: "7:15:33.34432 pm" :: () :: () :: () :: () :: []

Now, let's say that someone trips over the power cord and our
process dies at 3. Becuase of the persisted logging,
this is on disk:

"MyData1" :: () :: "7:15:33.34432 pm" :: () :: []

Since the functions are pure and deterministic, we can therefore
restart the process and run it to where we left off to catch up.
While the Async monad iterates over the log queue, it doesn't
actually perform the actions, those IO bindings. Instead, it just
pretends to but binds the cached values instead. The implicit
state is rebuilt using the initial state and the deterministic
transformations. Note: not explicitly represented here is the possiblity
that the workflows can register for events - this is likely the
primitive upon which the Wait and AssignTask would be built.

This should work even with parallel and other Async operations,
by my guess. So I want to know from you all, is this a good idea
or a bad one for modeling certain types of computation across
processes contexts?

If this is a good idea, then f# enforced purity would make it more
reliable.

Thanks

JDK

If I understand what you are asking here I would suggest that you do encapsulated transactions and cache the transaction to disk in a binary form. At that point, when you re-start, you not only know where it was, but you have both an understanding of if you have enough data to complete the transaction and log it as complete or need to role it back and log it as failed. The reason for separating the transaction from the log is to more efficiently deal with the transactions from a space and processing perspective (smaller and faster than ascii) while using the log for informational data unless some sort of debugging is on.

If I got what you were asking correctly, then yes, the above mentioned method is a good idea.

By code8 on 9/9/2008 3:02 PM (permalink)

Hi code8,

I don't think I understand completely. Do you mean to say that in addition to what's been outlined so far, you suggest that the host-side IO processing support transactions where possible to make the system even more reliable? If this is what you meant, yes, I think that's a good idea too.

As for your comparison to ascii - if you refer to my "1000 nulls" example, then yes, I certainly think that the details of the record/replay mechanism can change - I was just trying my best to focus on the core idea and leave the specifics of that implementation to the implementors.

Thanks!

By jdk on 9/9/2008 3:38 PM (permalink)

Hi JDK,

I'm a bit skeptical that it would be this easy, but then again, I'm by no means an expert.

Does this approach scale? What about repeated actions, or loops. Suppose I'm sending a thousand mails. A million. Wouldn't the log become very large? Wouldn't it take a very long time to "replay" it when you resume?

Can you still use non-IO bound local variables to store an intermediate result? I guess not, since you'd probably need that result to replay your state afterwards.

All this "automatic persistence" business is very tricky. What about open file handles? Sockets? When you replay, the outside world may rect very differently than when you actually did the action. Purity does not help you much since it's a theoretical notion: the world is not actually pure!

I'm not saying what you're trying is impossible, but I do think that, like you say, the OS it not made for this kind of stuff, and it seems to me that is going to leak through to your abstractions no matter what.

Maybe it would be a nice start if you try to make a domain specific language on top of WF that maybe deals with the complexities you mention?

just my 2c!

Kurt

By Kurt on 9/8/2008 12:49 AM (permalink)

Hi Kurt,

These are good questions.

I believe that this approach does scale to reasonable limits for the types of applications that I have in mind. Repeated actions and loops DO work, since the log has been generated from an initial state and then put through deterministic transformations, then the loop is the same each time, and the log will have those responses ready to replay. Right? By this same logic, non-IO bound local <i>values<i> (and mutable <i>variables<i>) would also be replayed and available where needed because, again, those code-defined transformations and the states they came from are all deterministic. It's BECAUSE they'd be pure (by dicipline) that we could guarantee that no different information is available upon replay. And yes, sending 1000 emails, or uploading 1000 documents to SharePoint, or assigning 1000 tasks will certainly result in a larger log (but in the case of the emails, maybe only 1000 bytes of nulls) All IO operations would have to be limited, I'm guessing, to operations that are completely tied up by the workflow host before continuing. That is, no streams would be passed back to user code, only buffers and other fixed structures - arrays of the data read, for instance. The Host would offer these IO operations (email, use a webservice, assign a task, log some reporting data, etc) and allow admins to add additional trusted IO operations. Also, the guest workflows can be composed to build up complex, reusable, and still pure workflows for things like requesting that the host send a certain format of email of a company-specific template. Does that clear up some of the questions? Thanks JDK

By jdk on 9/8/2008 3:38 AM (permalink)

Also, I think the syntax and development of the Monad could be simplified if the CTP's experimental Custom Bindings were explored further and added as part of the language. The custom bindings seems to pick up ANY word with '!' after it.

So instead of parsing out the intent of the request for the host from "let! SendEmailAsync ..." we could just say "SendEmailAsync! ...". This would be more natural for the user and less complex for the monad development. But I can't figure out how to even use this experimental feature - there's no documentation that I've found.

And SendEmailAsync and AssignTask maybe actually turn out to be higher-level pure functions which stitch together lower-level host-supported primatives.

I'm kinda rambling in this reply, but just wanted to make some notes.

By jdk on 9/9/2008 11:16 AM (permalink)

Hi JDK,

The remnants of the "experimental custom binding feature" in the CTP should be ignored and will be deleted in the next release. :-)

[ Aside: For those interested in this bit of archaeology, if you type

   foo! x = 1

then you get a warning related to this. However the syntax is always treates as "let!". :-) ]

Thanks

Don

By dsyme on 9/9/2008 12:29 PM (permalink)

Hi Don,

Thanks for that info.

But you're exactly the guy I'm interested in hearing from - What about the rest of it? Basically - adding a record and replay ability to the async workflow concept.

I was told about a similiar project written in Haskell called HAppS, although the focus is that it is a web server rather than a BPA engine.

Thanks!

By jdk on 9/9/2008 12:42 PM (permalink)

Hi JDK

It's certainly interesting, and looks feasible. H

Note this line is not correct:

        if (Async.Run(result) = Approved) then

Alarm bells should ring if you _ever_ see Async.Run inside a workflow: this is a blocking operation. Instead unblock the operation by binding:

1
2

        let! res = result
        if (res = Approved) then

Don

P.S. Another approach to BPA problems may be to start with a quotation of the workflow, and pre-process it to insert persistence at each binding point. That might allow you to lift the "nothing impure" restriction, and also to optionally translate the entire workflow to one of the BPA object models, or to choose to use other formats to persist the whole workflow. The FSharp.PowerPack.Linq.QuotationEvaluation quotation evaluator would come in useful here, though you should feel free to extend that if needed.

By dsyme on 9/9/2008 4:22 PM (permalink)

BPA processes are similar to regular programs except that they often do very little computation compared to the length of time that they are concidered running, sitting idle while still (and undesirably) consuming fixed resources as OS processes do. These programs still may perform arbitrarily computation and interact with the outside world through either-way and two-way communication. Also, BPA processes may need to outlive the OS processes that represent it. BPA is difficult to model well with current technologies because OS processes aren't built for this type of long-running, process-unaware model.

So, a BPA process has the following properties:

- its a long-running process
- it sits mostly idle, consuming fixed resources in an OS
- from time to time, it interacts with the outside world
- the process serializes its state to disk so that it can be reconstructed on demand whenever the process ends.

From my point of a view, a messageboard already has those properties. You can think of a thread as the BPA, and the posts in the thread being the stateful information; posts are serialized to disk to persist across process contexts, and de-serialized back into memory whenever a user wants to read the thread. Threads are mostly idle, not interacting with the user too often. Threads persist across application contexts, they can be reconstructed after a server reboots or a power outage.

If we concede that a messageboard thread satisfies the definition of a BPA, then it turns out that writing process-unaware code is a common, mundane programming task in any language.

The BPA you describe, which sends emails to people, is fundamentally no different from the one I described -- as long as you save the state of your process to disk from time to time, you can in principle terminate and reconstruct the process on demand and across application contexts.

Please correct me if I misunderstood you :)

By Juliet on 9/6/2008 7:57 PM (permalink)

I don't think you misunderstood me; I think you're being intentionally obtuse.

By jdk on 9/6/2008 8:30 PM (permalink)

I don't think you misunderstood me; I think you're being intentionally obtuse.

Not trying to be obtuse -- at least not intentionally. What you're trying to describe in your opening post just goes over my head (to me it doesn't sound any different from caching data to disk, but evidently you mean something more abstract), so I apologize if my reply wasn't helpful.

By Juliet on 9/7/2008 10:55 PM (permalink)

Hi Juliet,

No problem. You're right that I'm looking to cache data to disc, but it's the specific kind of data that makes it a challenge.

As an os process crawls along a program, stacks and heaps of data are formed in the processes memory space - the process is implicitly tracking the locations, sizes, and values of that data, but such information is not explicitly exposed. I'm making stuff up here, but a debug-mode program could maybe pause and store this info to disc, but in order to bring it back again - probably at a different location in memory - I don't think the debugger would have to be really smart to go in and update ALL the pointers to reflect the new locations. I don't know about that though. I'm not sure if the memory manager on the CPU does all the arithmetic live, storing only the relative values to raw memory, and which version of the addresses that the debugger would see.

In any case, the amount of data to be saved is probably prohibitively large relative and is definitely difficult to analyze. Also, this assumes that we map 1 os process to 1 business process instance since two otherwise unrelated instances could not be practically separated - that's not going to scale.

Rather than that nightmare, I propose that we need to decouple the business process from the OS process by considering the business process as a graph of pure transformations. If we restrict ourselves to writing code in this certain style, then it will be much easier to track all the critical pieces of information that would be needed to reconstruct a process's implicit state if the in-memory state were lost. Although the raw data in memory may be different upon reconstruction, logically at our level of abstraction, the data is identical. Given any initial state, and then any set of deterministic, pure transformations, we can arrive at the same outcome. Now, along the way, we may request that certain non-deterministic operations be performed on our behalf - the results of these IO operations are cached along with the initial state, and are fed back to us during a replay operation. This all should work even if the set of transformations is not known at start-up but is generated along the way - that is, we don't know which path the process will take without running it and the early transformation will tell the host which transformation to run subsequently.

This is confusing, I'm sure. I don't even know what I'm talking about - I'm just trying to explain it the best I can can get someone who's better at this stuff to comment on why it's not the BPA miracle drug, show me where it already exists, or start developing a working host.

By jdk on 9/9/2008 11:55 AM (permalink)

Topic tags

Built with WebSharper

Home

Answers

Events

Courses

Groups and Conferences

Blogs

Jobs

Developers

Topic tags