What is Chaining Scrapers?
Many websites provide a listing of products with links to each product detail page that has more details.
To get the details for all the products, you can configure two scrapers in order to create what we call chained scrapers.
First scraper (Parent scraper) gets a list of URLs (links) to the product pages. Second scraper (Child scraper) makes use of the output from the first scraper to collect data for the individual products. This is what is known as chaining scrapers.
How to Extract URLs with Chained scrapers
A good example would be where
https://www.flipkart.com/mobiles/mi~brand/pr?sid=tyy,4io&otracker=nmenu_sub_Electronics_0_Mi
There's a list of MI Mobiles on the homepage.
Each listing on this page includes URLs to the pages for all MI Mobiles details page, which again has more details about each mobile.
To get all of the details for each MI Mobile, you can create a listings scraper that extracts data from
https://www.flipkart.com/mobiles/mi~brand/pr?sid=tyy,4io&otracker=nmenu_sub_Electronics_0_Mi
During the configuration, you need to select the list and capture all of the links in one of the columns.
Once the listings scraper (AKA parent scraper) is created, you can create a detailed scraper (AKA child scraper) against one of the restaurant detail pages, like https://www.flipkart.com/redmi-8-sapphire-blue-64-gb/p/itme9614ba9b9bda
In this scraper, you can configure all the data fields you want like price, title, ratings etc.
After configuring and saving the scraper, we can then set the details scraper to use the URLs we extracted in our listings scraper.
On the ‘Manage Input’ tab of your details scraper, change the Input Source dropdown to URLs from another Scraper, set the Parent Scraper to your listings scraper, and select the URL column that has the extracted URLs.
One option you can also enable is Always run the parent first - run this when parent finishes.
This feature automatically runs the parent scraper before running the child scraper, irrespective of which one has a run triggered.
When you enable this option, the child scraper's schedule will be set by the parent scraper, with the scraper being triggered every time after the parent scraper runs.
Chained scrapers can have multiple levels, such as a products scraper that is chained to a listings scraper that is chained to a categories scraper.
Comments
0 comments
Please sign in to leave a comment.