5.3k
views5
comments

I'm tyring to parse a documentation format with fslex & fsyacc and I'm having problems with finding an efficient tokenization scheme for lex. Documentation in this format basically consists of some easily recognizable markers that define structure/format and text in between the markers. Defining the regexes for the markers is easy, what I can't figure out is how to retrieve the text between the markers as a single token. Passing the text between the markers char-wise to yacc strikes me as rather inefficient.

The text between the markers has no structure and might contain any char, so one can't just scan for [A-Za-z1-9 \t]* or similar patterns. If one could put a matched string back on the lexbuf one could probably solve the problem by introducing additional state, but there doesn't seem to be a documented way to put back strings into the lexbuf. Maybe there's some neat functional/recursive trick? Has anyone an idea?

Thanks in advance for any hint.

Stephan

The LexBuffer<'pos,'char> interface does not expose buffer_scan_length, so "putting back" the current regexp match is not an option. This is in contrast to Ocaml, which allows allows to manipulate lexbuf.lex_curr_pos for this purpose.

Is there any chance that a future version of the F# lexer exposes an interface for "putting back" (part of) the current regexp match in a lexer action?

Is there maybe any UGLY HACK (tm), other than changing the fslib, which would allow me to access a hidden field of LexBuffer?

Stephan

By Stephan on 9/5/2007 2:04 PM (permalink)

The following code shows how to use reflection to manipulate the protected fields in lexbuf in order to put back chars of the current match:

{
// (...)
open Lexing
open System.Reflection

let scanLengthField = (type Lexing.lexbuf).GetField("_buffer_scan_length", BindingFlags.Instance ||| BindingFlags.NonPublic)

let lexemeLengthField = (type Lexing.lexbuf).GetField("_lexemeLength", BindingFlags.Instance ||| BindingFlags.NonPublic)

let putBack lb n = scanLengthField.SetValue(lb, (scanLengthField.GetValue(lb) :?> int) - n);

                   lexemeLengthField.SetValue(lb, (lexemeLengthField.GetValue(lb) :?> int) - n)
// (...)
}

let text = // (...)
let markup = ['{' '}']

rule token = parse
| text markup { putBack lexbuf 1; TEXT(lexeme lexbuf) } // put back the markup char and return text 
// (...)

By Stephan on 9/20/2007 2:12 PM (permalink)

Hi Stephan,

Would it be better to have two different lexer rules for this? you can create seperate rules using the "and" keyword then call them in your rule matched code. So the idea would be as soon as you find the first chacter of your inbetween bit you hop into a new rule which gathers up all the tokens and then passes them to lex (the lexer supplied with F# uses this technique for parsing comments and strings). I haven't got time to put together a working sample, but a lexer like this in pseduo code would look something like:

rule token = parse
 | "<starttag>" { STARTTAG }
 | .            { middleBit lexeme} 
and middleBit x = parse
 |  .           { middleBit (x + lexeme) } 
 | "<endtag>"   { ENDTAG }

Hope that helps,
Rob

By Robert on 9/6/2007 5:50 AM (permalink)

Hi Robert,

thanks for your reply.

Actually I'm already using separate rules and I could need the "put back match" feature for the following kind of setup:

let regex1 = // a more or less complicated regex identifying a markup token
(...)
let non_markup = // any character that can not be the
                 // first char of regex1,...,regexn

rule token = parse
| regex1 { MARKUP1() }
| regex2 { MARKUP2() }
(...)
| _      { text (lexeme lexbuf) lexbuf }

and text str = parse
| non_markup* { TEXT(str ^ lexeme lexbuf) }
| _           { put back character on lexbuf; token lexbuf }

My problem is that my "endtag" is not context independent and may in a particular context just be normal Text. The above lexer would allow me to parse the text relatively efficiently.

Stephan

By Stephan on 9/6/2007 7:47 AM (permalink)

I don't know how to do this in fslex / fsyacc, but in the past I've done something like the following:

1) on the 'entry' token, switch the lexer to a 'text' state.
2) in the text state, recognize strings that match any chars followed by F(end) as a token, and return the token (call it PARTIALTEXT or something like that, for instance), where F(end) is the first character of the 'exit' token.
3) in the text state, recognize strings that match F(end)...L(end) (for instance '<' ... '>') where ... is anything valid for a 'exit' token, not necessary the correct exit token. If the text matches the current 'entry' token, then exit the 'text' state when returning this token. If it doesn't match, return the text as a PARTIALTEXT token.
4) in the text state, return invalid end tokens as PARTIALTEXT tokens (i.e. '<' followed by something not valid for a end token).
5) in the parser, make a rule that combines a string of PARTIALTEXT into a single TEXT (or something like that).

Hope this helps,
Kelly Leahy
Milliman, Inc.

By schutnik on 9/17/2007 12:04 PM (permalink)

Topic tags

Built with WebSharper

Home

Answers

Events

Courses

Groups and Conferences

Blogs

Jobs

Developers

Topic tags