#9 ✓resolved
Wildgoose

Detail pages are broken with relative urls

Reported by Wildgoose | December 11th, 2008 @ 08:20 PM

When scraping pages with urls like: http://somewhere.com/base/mypage

and wanting detail pages for urls like: href="somewhere_else.htm"

Then in mechanize.rb/handle_relative_url we resolve this (incorrectly) to: http://somewhere.com/somewhere_e...

The erroneous code is:

        case resolve
          when :full
            @@current_doc_url = (@@host_name + doc_url) if ( @@host_name != nil && (doc_url !~ /#{@@host_name}/))
            @@current_doc_url = @@current_doc_url.split('/').uniq.join('/')

Clearly we need to use the base url from the previous page, not just the hostname. I am not sure how to fix this though. I'm following the code back from detail_page_filter.rb and the whole use of @@host_name just looks wrong? Shouldn't we be keying off the URL history instead and just dropping all this @@host_name and @@original_host_name stuff?

Grateful for some suggestions to fix this?

Comments and changes to this ticket

  • Peter Szinek

    Peter Szinek December 15th, 2008 @ 10:37 AM

    • Assigned user set to “Peter Szinek”
    • State changed from “new” to “open”

    Well yeah, this is the worst part of the whole scRUBYt! codebase for sure, very bad initial decisions followed by layers and layers of hacks to fix the problem introduced in step n-1 - I was a total Ruby beginner when I wrote it (I guess this is obvious from the quality of the code :) and never could muster enough courage to refactor it since then... there are some workarounds introduced though check out this part:

    
                case resolve
                  when :full
                    @@current_doc_url = (@@host_name + doc_url) if ( @@host_name != nil && (doc_url !~ /#{@@host_name}/))
                    @@current_doc_url = @@current_doc_url.split('/').uniq.join('/')
                  when :host
                    base_host_name = (@@host_name.count("/") == 2 ? @@host_name : @@host_name.scan(/(http.+?\/\/.+?)\//)[0][0])
                    @@current_doc_url = base_host_name + doc_url
                  else
                    #custom resilving
                    @@current_doc_url = resolve + doc_url
                end
    

    For 99% you want :resolve => host, sometimes :resolve => 'http://my.stuff' if really nothing else helps - in this case the relative URL is resolved against http://my.stuff rather than what scRUBYt! (falsely) thinks it should be resolved against. I hope this helps - if it doesn't, just drop me a mail/update ticket.

    This stuff working nicely in the 'from scratch' branch (skimr - http://github.com/scrubber/scrub... it is even possibly to do some funky stuff there like parallel crawling to detail pages etc. so unless someone is going to pay me $1000/hour for rewriting this in the old branch, I'll concentrate my efforts on moving the good code to the skimr branch rather than the other way around...

  • Wildgoose

    Wildgoose December 15th, 2008 @ 10:51 AM

    Aha, I didn't know about a new branch. Can you comment on it's fitness for purpose right now? Is it finished enough that people like me can have a go at using it? Perhaps update the main trunk readme to mention it in the meantime?

    Perhaps you could offer some writeup on the new architecture so that folks could help and contribute to the new branch? I agree that if there is a rewrite in progress then abandon the old branch. I did actually use the resolve method you suggested to complete my scrape, it was a good workaround to have available

    More widely - I have an idea to allow people the chance to enter user generated scrappers into a project I am building. It's nothing to do with TV, but you could imagine it being the kind of project like xmltv where people can contribute their own scrappers to grab certain kinds of data (actually xmltv is probably an excellent source of new tests for scrubyt...). Scrubyt seems like an attractive starting point because it's a fairly limited parser and on the surface doesn't seem to offer the user the chance of arbitrary code execution

    Where I find scrubyt a bit limited and for sure it's outside of scope, but it's with building the final output in EXACTLY the right fashion. You seem to be able to get close in many cases, but where you need an exact XML output (or whatever) it seems likely that you need almost to parse the output a second time and re-order some stuff and perhaps jig things around. Do you think a limited meta language to handle that final output step might be within scope of scrubyt (or perhaps is already implemented by some other project?)

  • Peter Szinek

    Peter Szinek December 15th, 2008 @ 11:08 AM

    • State changed from “open” to “resolved”

    Re: skimr: Are you subscribed to the blog? http://scrubyt.org/blog/ I mentioned it there, though it's a good idea to mention it in the readme.

    Due to other responsibilities I couldn't contribute to the skimr branch yet, so your best shot is to ask the guy who did it - Glenn Gillen (http://www.rubypond.com/, glenn dot gillen at [nooooooospham] gmail. I know he is using it commercially so it can't be that bad, but you should ask him. He told me he is going to write some blog posts on the topic, so I am sure if you nudge him it will motivate him to write them asap ;-)

    re: more widely- I got a contract with the goal of doing just that from January, so stay stuned ;) re: arbitrary code execution - you know about the script pattern right?

    re: limited - I think moulding the result is not out of scope at all, just missing at the moment ;) You can use to_hash and to_flat_hash though and use Ruby to shape the result though - but I am working on every kind of modifications in this area too.

    Thanks for suggestions / feedback, keep them coming!

  • Josh Wand

    Josh Wand February 10th, 2009 @ 06:07 PM

    I can't get this to work with next_page:

    http://pastie.org/385121

    Even specifying the base url explicitly doesn't seem to have any effect.

    I've tried this with both the rubyforge (0.4.4) and github (0.4.11) versions.

  • Peter Szinek

    Peter Szinek February 11th, 2009 @ 08:25 AM

    Next/detail pages are correctly working (and much more effective) in the skimr branch. Unfortunately it'd be close to impossible to backport it from there - the architecture is so different that it would just make no sense (and as I wrote for the other ticket, there is just no time to do such big changes to the old branch).

    I would probably do what I suggested in another thread on the ML:

    "...break down the scraper into more scrapers - not ideal, but does the job. So you could create a scraper which generates all the links, save them to a yaml file and create a second scraper which iterates on these links"

    Sorry, there is just too much going on to fix everything (in the old branch I mean - the skimr branch has to be bullet proof), esp. complex problems like this one...

Please Sign in or create a free account to add a new ticket.

With your very own profile, you can contribute to projects, track your activity, watch tickets, receive and update tickets through your email and much more.

New-ticket Create new ticket

Create your profile

Help contribute to this project by taking a few moments to create your personal profile. Create your profile ยป

A simple to learn and use, yet powerful web scraping toolkit written in Ruby.

People watching this ticket

Pages