877
views4
comments

I have to read some text from a TextReader and covert it into a sequnce of tokens. The way I want to do it is to lazy read it from the reader and generate tokens per request. To do this I created a type implementing IEnumerable and IEnumerator interfaces:

 

    type internal Tokenizer (template:TextReader) =
        let mutable current: Token = Start
        
        interface IEnumerator<Token> with
            member this.Current = current
        
        interface IEnumerator with
            member this.Current = current :> obj
            member this.MoveNext() =
                match template.ReadLine() with
                | null -> 
                    false
                | _ as line -> 
                    current <- Text (new TextToken(line + "\r\n", 0, 0))
                    true
                
            member this.Reset() = failwith "Reset is not supported by Tokenizer"
        
        interface IEnumerable<Token> with
            member this.GetEnumerator():IEnumerator<Token> =
                 this :> IEnumerator<Token>

        interface IEnumerable with
            member this.GetEnumerator():IEnumerator =
                this :> IEnumerator

        interface IDisposable with
            member this.Dispose() = ()
                template.Dispose()

The behavior of this class buffles me - I see calls coming to MoveNext method, but there is no calls to the Current properties. Also when I am checking if the sequence is empty (Seq.is_empty). It calls the Move_next again and after that calls dispose on the object. Did I confuse it by implementing both IEnumerable and IEnumerator on the same class? I only use this sequence once, and if the runtime tries to create another enumerator over the same IEnumerable - I am in trouble no matter what I do, because the underlying 'sequence' is the TextReader, which cannot be 'reset'

What am I missing here?

The way you have the class now, you can only safely get an enumerator once - anyone who uses an enumerator is bound to Dispose() it when they're done, at which point this object is dead. So this class is not too useful, since its legal-use-pattern is unexpected.

If you want an IEnumerable that's reusable (e.g. can call GetEnumerator on it more than once), your best (only?) bet is to buffer. One simple implementation is to use this class as an implementation detail of a function that takes a TextReader and returns an IEnumerable<Token>. The function body would be like

new Tokenizer(reader) |> Seq.cache

where Seq.cache does the buffering.

By brianmcn on 5/9/2009 5:25 PM (permalink)

The way you have the class now, you can only safely get an enumerator once ...

This is exactly my intention. I only use the sequence once. At least I only use it once explicitly. It seems though that Seq.is_empty creates another enumerator over the same enumerable. At least if this would be the case, it would explain the behavior it displays.

If you want an IEnumerable that's reusable (e.g. can call GetEnumerator on it more than once), your best (only?) bet is to buffer. One simple implementation is to use this class as an implementation detail of a function that takes a TextReader and returns an IEnumerable<Token>. The function body would be like
new Tokenizer(reader) |> Seq.cache
where Seq.cache does the buffering.

But would the Seq.cache read the sequence to the end to cache it? This would defeat the purpose of the refactoring I am doing. I had this code working with entire file read into a string and then working off the string. I wanted to change it to avoid reading entire file into memory only to discard it after I tokenized it

By mfeingold on 5/9/2009 10:18 PM (permalink)

This is exactly my intention. I only use the sequence once. At least I only use it once explicitly. It seems though that Seq.is_empty creates another enumerator over the same enumerable. At least if this would be the case, it would explain the behavior it displays.

Yes, of course. Seq.is_empty, and every other function in the universe that operates on seqs calls GetEnumerator(). That's the only function there is on IEnumerable! If you want to know anything about the contents of an IEnumerable, you must call GetEnumerator and then MoveNext/Current.

But would the Seq.cache read the sequence to the end to cache it? This would defeat the purpose of the refactoring I am doing. I had this code working with entire file read into a string and then working off the string. I wanted to change it to avoid reading entire file into memory only to discard it after I tokenized it.

Try it. Seq.cache does exactly what you want.

[link:research.microsoft.com]

By brianmcn on 5/9/2009 10:49 PM (permalink)

But would the Seq.cache read the sequence to the end to cache it? This would defeat the purpose of the refactoring I am doing. I had this code working with entire file read into a string and then working off the string. I wanted to change it to avoid reading entire file into memory only to discard it after I tokenized it.
Try it. Seq.cache does exactly what you want.
[link:research.microsoft.com]

Ha, I see. For some reason I expected that the Seq.cache functionality is already built into every seq.

By mfeingold on 5/10/2009 7:10 AM (permalink)

Topic tags

Built with WebSharper

Home

Answers

Events

Courses

Groups and Conferences

Blogs

Jobs

Developers

Topic tags