Well, matching the full text with a regex should definitely work, but you need to get the regex right of course (which at the moment it isn't). Depending on the situation it might be more appropriate to go through the text in chunks (e.g. lines), but for most HTML-related tasks, there is no need bothering.

Code could look something like this:

1
2
3
4
5
6
7
8
Regex.Matches("input", "[aeiouy]")

|> Seq.cast < Match >

|> Seq.map (fun m -> m.Value)

|> List.of_seq;;

val it : string list = ["i"; "u"]

By on 1/26/2009 9:53 PM ()

Maybe with this regex?

1
2
3
4
5
6
7
8
9
10
11
12
let linksFromHtml html =

  let regex = "http://www.nba.com/games/\\d{8}/([^/]+)/gameinfo.html"

  Regex.Matches(html, regex)

  |> Seq.cast < Match >

  |> Seq.map (fun m -> m.Groups.[1].Value, m.Value)

  |> List.of_seq

BTW: it's better to label regex groups explicitly, but I still don't get how to put the < and > in my code without having them disappear

By on 1/26/2009 10:06 PM ()

deleted ... missed your Seq.Cast

By on 1/26/2009 10:32 PM ()

Hi all!

Sorry I haven't answered earlier, I have been QUITE busy the past weeks.

Going back to the topic... I still can't get it. If you remembered, the thing was to get the games that are in the NBA webpage, which each day has an url like this: [link:www.nba.com]

The code so far is this:
#light

open System.IO
open System.Net
open System.Text.RegularExpressions

let mydate = "19/02/2009"

/// Function to transform a date given as "dd/mm/yyyy" to the one required by the NBA webpage as "yyyymmdd"
let dateTransform date = date |> String.split ['/'] |> List.rev |> List.reduce_left (+)

/// Transform the given date to the NBA webpage format
let dateNBA = mydate |> dateTransform

/// Compose exact URL
let myUrl = "[link:www.nba.com] + dateNBA + "/scoreboard.html"

/// Get the contents of the URL via a web request
let http (url: string) =
let req = System.Net.WebRequest.Create(url)
let resp = req.GetResponse()
let stream = resp.GetResponseStream()
let reader = new StreamReader(stream)
let html = reader.ReadToEnd()
resp.Close()
html

//The actual HTML content
let webpage = myUrl |> http

//Extraction of matches for the date
let linksFromHtml webpage =
let regex = "[link:www.nba.com]
Regex.Matches(webpage, regex)
|> Seq.cast < Match >
|> Seq.map (fun m -> m.Groups.[1].Value, m.Value)
|> List.of_seq

The part that's failing is that last one who you kindly tried to help me with. I've gone through the entire chapter on regular expressions on the Expert F# book but haven't been able to fix that last function.
As it is now, it claims that Regex.Matches(webpage, regex) should have type 'unit but has type 'bool.

I don't know where the error is, as MSDN claims that Regex.Matches "Searches an input string for all occurrences of a regular expression and returns all the successful matches" .
That should return "The MatchCollection of Match objects found by the search" as they saym which really doesn't help me too much.

Any hint on this?

Thank you for all the effort on helping guys with their first steps inot this :).

Cheers!

By on 2/23/2009 6:49 AM ()

Strange, exact same code works fine for me. What version of f# and .net framework are you using?

Also, for each of the following two functions

1
let test = Regex.Matches("test", "test") |> Seq.cast<Match>
1
2
3
let test2 = Regex.Matches("test", "test")

What type does VS tell you that test and test2 are? It looks to me like for some reason it's thinking the function has a different type than it does?

Oh and one more thing, I assume the < > are in your code because you didn't wrap it in code tags and it translated your symbols. But just in case you actually copied it out of brilsmurf's post and didn't replace them with the correct symbols, they should actually be the less than and greater than symbols, like I put in mine above. Hopefully that's not too obvious lol, but I'm not sure why else you'd get such a strange error.

Edit: OK I noticed something very weird. If I replace the less than and greater than symbols in my code (that works) with < and > and hover over my final variable that I assigned to it says that it's type bool even though it gives me a syntax error (albeit a different syntax error than you are getting). If I change my working code slightly from this:

1
2
3
4
5
let result =
    Regex.Matches("test", "test")
    |> Seq.cast<Match>
    |> Seq.map (fun m -> m.Groups.[1].Value, m.Value)
    |> List.of_seq

to this:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
let result =


    Regex.Matches("test", "test")


    |> Seq.cast <Match>


    |> Seq.map (fun m -> m.Groups.[1].Value, m.Value)


    |> List.of_seq

it still fails but now it tells me that result has type 'unit'. Note that the only difference between these two code fragments is the presence of a single whitespace character between Seq.cast and the type annotation. So, try it exactly verbatim to the one that I verified works (the one without the whitespace character) and see what happens.

By on 2/23/2009 7:09 AM ()

Fortunately or unfortunately for you, I am that dumb... and I did put &lt and &gt instead of their corresponding symbols. As these are my first steps I just thought they would had a purpose I didn't understand as I haven't nearly finished the book and this is my first program in F#... sorry.
Also, sorry for the code tags... I didn't see any icon for them and thought they weren't allowed. I'll put them from now on.

On the matter... it all seems to be correct now... there's no syntactic mistake, but I still can't get the list of matches. It might be a problem with the regular expression or something, as I get an empty list.

If anyone wants to try, the code right now is like this:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
#light

open System.IO
open System.Net
open System.Text.RegularExpressions

let mydate = "23/02/2009"  

/// Function to transform a date given as "dd/mm/yyyy" to the one required by the NBA webpage as "yyyymmdd"
let dateTransform date = date |> String.split ['/'] |> List.rev |> List.reduce_left (+)

///  Transform the given date to the NBA webpage format
let dateNBA = mydate |> dateTransform

/// Compose exact URLgames;;

let myUrl = "http://www.nba.com/games/" + dateNBA + "/scoreboard.html"

/// Get the contents of the URL via a web request
let http (url: string) =
    let req = System.Net.WebRequest.Create(url)
    let resp = req.GetResponse()
    let stream = resp.GetResponseStream()
    let reader = new StreamReader(stream)
    let html = reader.ReadToEnd()
    resp.Close()
    html

//The actual HTML content
let webpage = myUrl |> http
let re = "http://www.nba.com/games/\\d{8}/([^/]+)/gameinfo.html"
let result =
    Regex.Matches(webpage, re)
    |> Seq.cast<Match>
    |> Seq.map (fun m -> m.Groups.[1].Value, m.Value)
    |> List.of_seq
  
let games = result 
    
let showGames games = 
   games |> List.iter (fun x -> printf "%s" x)

Thank you for all the help! Appreciate it.

Cheers!

By on 2/23/2009 10:21 AM ()

Heh, no problem. Just for reference, the code tags needs an attribute of language="F#". Like this:

1
2
<code lang=fsharp>
..code goes here

</code> As for the regex, it's been a while since I've done anything with regexes, but the \\d might be the problem. \d is a digit already, I'm not sure what the extra slash would do. What if you change it to: <code lang=fsharp> let re = "http://www.nba.com/games/[^/]*/[^/]*/gameinfo.html" </code> If that works, then you know the problem is with the digit specification.

By on 2/23/2009 1:48 PM ()

I don't know what I'm doing wrong but it seems I just can't get this right hehe!

With the regex you put the result is the same. Even if I do it with something like this

1
let re = "http://www.nba.com/games/*" 

The final list is empty... I guess the the problem elsewhere... :).

Thanks for everything!

By on 2/23/2009 2:38 PM ()

Ahh, I see. For starters, you need to use "captures". By default there is exactly 1 match on a regular expression, and that is the entire string. In this case the URL. If I run the code with the URL given in the beginning, I get the list

[("", "[link:www.nba.com]

as the result. If you want to extract the date and the teams, you need to mark them in the regular expressino that you want to capture them with parentheses.

1
let re = "http://www.nba.com/games/([^/]*)/([^/]*)/gameinfo.html"

This should save the values 20090125 and DALBOS. I don't remember how you access them, if it's through the "Groups" or "Captures", I remember it was unintuitive last time I had to do it, but hopefully that gets you on the right track.

By on 2/23/2009 2:59 PM ()

That's definitely helpful.

Just yesterday I had a deep look at Groups, etc in the regex and some more code other than the book's on how to manage them.

I'll try again when I come at night.

Thanks for all the interest you've put in this.

Cheers!

By on 2/24/2009 12:38 AM ()

I finally did it... at last!

It all comes down to a decision from nba.com to change the way the did put the links, hehe!

They were put before as "www.nba.com/games/20090302/NOHPHI/gameinfo.html" while now they have reduced each link to "games/20090302/NOHPHI/gameinfo.html". That's why I was having a void result from the match on the regex.

Code now looks like this

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
#light

open System.IO
open System.Net
open System.Text.RegularExpressions

let mydate = "02/03/2009"  

/// Function to transform a date given as "dd/mm/yyyy" to the one required by the NBA webpage as "yyyymmdd"
let dateTransform date = date |> String.split ['/'] |> List.rev |> List.reduce_left (+)

///  Transform the given date to the NshwBA webpage format
let dateNBA = mydate |> dateTransform

/// Compose exact URLgames;;

let myUrl = "http://www.nba.com/games/" + dateNBA + "/scoreboard.html"

/// Get the contents of the URL via a web request
let http (url: string) =
    let req = System.Net.WebRequest.Create(url)
    let resp = req.GetResponse()
    let stream = resp.GetResponseStream()
    let reader = new StreamReader(stream)
    let html = reader.ReadToEnd()
    resp.Close()
    html

//The actual HTML content
let webpage = myUrl |> http

/// Regular expression to match the date and teams involved
/// We take as Group "date" the numbers after the first slash "/"
/// We take as Group "teams" the rest untill the next slash "/"
let re = @"games/(?<date>[0-9]+)/(?<teams>\w+)"

let result =
    Regex.Matches(webpage, re)
    |> Seq.cast<Match>
    |> Seq.map (fun m -> m.Groups.Item("date").Value, m.Groups.Item("teams").Value, m.Value)
    |> List.of_seq

> result;;
val it : (string * string * string) list
= [("20090302", "NOHPHI", "games/20090302/NOHPHI");
("20090302", "ATLWAS", "games/20090302/ATLWAS");
("20090302", "CLEMIA", "games/20090302/CLEMIA");
("20090302", "DALOKC", "games/20090302/DALOKC");
("20090302", "SASLAC", "games/20090302/SASLAC")]

It could be better... I suppose. At first, I think it would be good to separate the teams initials directly from the regex... but I will leave it as is til I get my copy of "Mastering regular expressions" ;).

Thanks, thanks, thanks a lot for all the help and interest put in solving my problem.

Cheers!

By on 3/3/2009 6:36 AM ()

Can maybe be it that according to MSDN, MatchCollection "Represents the set of successful matches found by iteratively applying a regular expression pattern to the input string."
[SerializableAttribute]
public class MatchCollection : ICollection,
IEnumerable
I'm new to .Net and F# so I don't really know if there's a problem managing an ICollection and IEnumerable object as a Sequence. Where can I look at those kind of things to try not bother you all more than necessary?

Thanks!

By on 2/23/2009 6:57 AM ()
IntelliFactory Offices Copyright (c) 2011-2012 IntelliFactory. All rights reserved.
Home | Products | Consulting | Trainings | Blogs | Jobs | Contact Us | Terms of Use | Privacy Policy | Cookie Policy
Built with WebSharper