Posted in:

How to Perform Web Scraping With Puppeteer

Web scraping is useful for your business because when employed correctly, it helps you to extract tons of relevant data from your competitor websites. The data can be further analyzed to make better marketing and financial decisions.

There are several ways of data extraction, like manual copy-pasting, text-grepping, DOM parsing, HTTP programming, and web scraping.

Out of all the above methods, web scraping is the most widely used technique for data extraction. In this article, we will learn how you can effectively perform web scraping with puppeteer. But, first, let’s learn some basics of puppeteer and web scraping.

What is Puppeteer?

Puppeteer is a high-level API maintained by Google to control headless Chrome. It is a node library to control headless Chrome over the DevTools protocol. Headless Chrome or a headless browser has no UI, and a scraper can interact with it easily. Here, Google Chrome is the puppet that can be manipulated to perform a set of custom tasks as per our needs.

Puppeteer can be used to perform the following sample tasks:

● Generate PDFs and screenshots of different pages.
● Crawl a SPA and pre-populate a page with custom content (i.e., “SSR”).
● Automatically signing into sites like Facebook.
● Scrape links and content from different websites.
● Emulating Googlebot WRS for SEO.
● Automate UI testing and form submission.
● Create an automated testing environment to run your tests directly in the latest version of Chrome using the latest JavaScript and browser features.
● Capture a timeline trace to help diagnose performance issues.

What is web scraping?

Web scraping is a data extraction technique where automated bots or crawlers are employed to scrape content from different websites for later analysis. Web scraping involves two fundamental processes:

● Fetching – It refers to the downloading of the page via a crawling software.
● Extraction – It refers to the parsing and copying of data into a spreadsheet for further processing.

Automated web scraping is essential for your business because:

● It lets you scrape competitor product information like reviews and pricing.
● You can get quality contact lists by scraping information from online niche directories.
● It lets you analyze the latest industry trends before you launch a product.
● You can optimize your products or services’ pricing, and description by applying an in depth analysis on data extracted by crawling ecommerce websites.
● It lets you detect investment risks so that you can make accurate business decisions.
● It saves time and effort because web scraping using a bot can be done faster than manual scraping.

How to Use Puppeteer API for Web Scraping?

Here are the steps to follow:

Step 1: Install the NodeJS, which is an asynchronous event-driven JavaScript runtime.

Step 2: Create a folder and open it in your command prompt. The following code is required:

$ mkdir scraper-demo
$ cd scraper-demo

Step 3: Run the code npm init -y for managing project dependencies.

Step 4: Now, install the puppeteer and a version of Chromium using the following code:

$ npm install –save puppeteer

Step 5: Now, open your folder in the code editor and add the following code in the scripts section:

**code begins**

{
. . .
“scripts”: {
“test”: “echo \”Error: no test specified\” && exit 1″,
“start”: “node index.js”
},
. . .
“dependencies”: {
“puppeteer”: “^5.2.1”
}
}

**code ends**
Save the changes and close the editor.

Step 6: In this step, we will create four .js files, namely browser.js, index.js, pageController.js, and pageScraper.js.

Open browser.js and add the following code in it:

**code begins**

const puppeteer = require(‘puppeteer’);

async function startBrowser(){
let browser;
try {
console.log(“Opening the browser……”);
browser = await puppeteer.launch({
headless: false,
args: [“–disable-setuid-sandbox”],
‘ignoreHTTPSErrors’: true
});
} catch (err) {
console.log(“There was an error creating a browser instance => : “, err);
}
return browser;
}

module.exports = {
startBrowser
};

**code begins**

Open index.js and add the following code in it:

**code begins**

const browserObject = require(‘./browser’);
const scraperController = require(‘./pageController’);

let browserInstance = browserObject.startBrowser();

scraperController(browserInstance)

**code ends**

Open pageController.js and add the following code in it:

**code begins**

const pageScraper = require(‘./pageScraper’);
async function scrapeAll(browserInstance){
let browser;
try{
browser = await browserInstance;
await pageScraper.scraper(browser);

}
catch(err){
console.log(“There was an error creating a browser instance => “, err);
}
}

module.exports = (browserInstance) => scrapeAll(browserInstance)

**code ends**

Open pageScraper.js and add the following code in it:

**code begins**

const scraperObject = {
url: ‘http://books.toscrape.com’,
async scraper(browser){
let page = await browser.newPage();
console.log(`Navigating to ${this.url}…`);
await page.goto(this.url);

}
}

module.exports = scraperObject;

**code ends**

Save and close all your files.

Step 7: Now, we will begin the scraping process. Open the PageScraper.js file and add the following code in it. Replace the URL with the website URL whose contents are sorted using the list tag.

**code begins**

const scraperObject = {
url: ‘http://urltoscrape.com’,
async scraper(browser){
let page = await browser.newPage();
console.log(`Navigating to ${this.url}…`);
await page.goto(this.url);
await page.waitForSelector(‘.page_inner’);
let urls = await page.$$eval(‘section ol > li’, links => {
links = links.filter(link => link.querySelector(‘.instock.availability > i’).textContent !== “In stock”)
links = links.map(el => el.querySelector(‘h3 > a’).href)
return links;
});

let pagePromise = (link) => new Promise(async(resolve, reject) => {
let dataObj = {};
let newPage = await browser.newPage();
await newPage.goto(link);
dataObj[‘bookTitle’] = await newPage.$eval(‘.product_main > h1’, text => text.textContent);
dataObj[‘bookPrice’] = await newPage.$eval(‘.price_color’, text => text.textContent);
dataObj[‘noAvailable’] = await newPage.$eval(‘.instock.availability’, text => {
text = text.textContent.replace(/(\r\n\t|\n|\r|\t)/gm, “”);
let regexp = /^.*\((.*)\).*$/i;
let stockAvailable = regexp.exec(text)[1].split(‘ ‘)[0];
return stockAvailable;
});
dataObj[‘imageUrl’] = await newPage.$eval(‘#product_gallery img’, img => img.src);
dataObj[‘bookDescription’] = await newPage.$eval(‘#product_description’, div => div.nextSibling.nextSibling.textContent);
dataObj[‘upc’] = await newPage.$eval(‘.table.table-striped > tbody > tr > td’, table => table.textContent);
resolve(dataObj);
await newPage.close();
});

for(link in urls){
let currentPageData = await pagePromise(urls[link]);
// scrapedData.push(currentPageData);
console.log(currentPageData);
}

}
}

module.exports = scraperObject;

**code ends**

That’s it! You just created a scraping application using puppeteer.

Final Thoughts

Scraping is essential for businesses to have the upper hand over their competitors. Performing web scraping using puppeteer is easy. You can create files of your choice and add some basic codes to start scraping your competitor data.