#22 new

Detail pages return duplicate data.

Reported by sgray | June 8th, 2009 @ 02:20 PM

I'm copying a post I made on the google groups blog that describes the issue.

You are having the same problem I reported in "Detail page problems".
There appears to be a bug in detail page processing. I was able to
make my examples work by changing detail_page_filter.rb. The change I
made is a kludge at best however. It does highlight the issue.

Here's what I changed in detail_page_filter.rb.

  #if @detail_extractor.nil?
  #  @detail_extractor = Extractor.new

@parent_pattern.extractor.mode, @parent_pattern.referenced_extractor

  #  root_results = @detail_extractor.result
  #  root_results = @detail_extractor.evaluate_extractor

  @detail_extractor = Extractor.new @parent_pattern.extractor.mode, @parent_pattern.referenced_extractor
  root_results = @detail_extractor.result

This creates a new @detail_extractor each time. There are problems in
the evaluate_extractor method. When called it returns the previous
results. I'm not sure why but the method evaluate_extractor in
extractor.rb has a couple of issues when processing detail pages that
I don't know how to resolve.

  catch :quit_next_page_loop do
    loop do
      url = get_current_doc_url #TODO need absolute address here


      @processed_pages << url
      @root_patterns.each do |root_pattern|



The line "url = get_current_doc_url" always returns nil. The line
"@root_results.push(*root_pattern.evaluate(get_hpricot_doc, nil))" is not executed. I falls out of the loop without doing anything and
returns the previous results.


No comments found

Please Sign in or create a free account to add a new ticket.

With your very own profile, you can contribute to projects, track your activity, watch tickets, receive and update tickets through your email and much more.

New-ticket Create new ticket

Create your profile

Help contribute to this project by taking a few moments to create your personal profile. Create your profile ยป

A simple to learn and use, yet powerful web scraping toolkit written in Ruby.

People watching this ticket