Detail pages return duplicate data.
Reported by sgray | June 8th, 2009 @ 02:20 PM
I'm copying a post I made on the google groups blog that describes the issue.
You are having the same problem I reported in "Detail page
problems".
There appears to be a bug in detail page processing. I was able
to
make my examples work by changing detail_page_filter.rb. The change
I
made is a kludge at best however. It does highlight the issue.
Here's what I changed in detail_page_filter.rb.
#if @detail_extractor.nil?
# @detail_extractor = Extractor.new
@parent_pattern.extractor.mode, @parent_pattern.referenced_extractor
# root_results = @detail_extractor.result
#else
# root_results = @detail_extractor.evaluate_extractor
#end
@detail_extractor = Extractor.new @parent_pattern.extractor.mode, @parent_pattern.referenced_extractor
root_results = @detail_extractor.result
This creates a new @detail_extractor each time. There are
problems in
the evaluate_extractor method. When called it returns the
previous
results. I'm not sure why but the method evaluate_extractor in
extractor.rb has a couple of issues when processing detail pages
that
I don't know how to resolve.
catch :quit_next_page_loop do
loop do
url = get_current_doc_url #TODO need absolute address here
2/4
@processed_pages << url
@root_patterns.each do |root_pattern|
@root_results.push(*root_pattern.evaluate(get_hpricot_doc,
nil))
end
The line "url = get_current_doc_url" always returns nil. The
line
"@root_results.push(*root_pattern.evaluate(get_hpricot_doc, nil))"
is not executed. I falls out of the loop without doing anything
and
returns the previous results.
Scott
No comments found
Please Sign in or create a free account to add a new ticket.
With your very own profile, you can contribute to projects, track your activity, watch tickets, receive and update tickets through your email and much more.
Create your profile
Help contribute to this project by taking a few moments to create your personal profile. Create your profile ยป
A simple to learn and use, yet powerful web scraping toolkit written in Ruby.