Top 8 NodeJs Web Scraping Libraries

·

10 min read

In this post, we are going to talk about all the tools/libraries offered by Nodejs for web scraping. We will first start with some easy and basic libraries and then move ahead with advanced tools.

We will briefly talk about the pros and cons of each tool. We will try to highlight every small detail of every tool that can help us while scraping.

Web Scraping with NodeJS: Let's Get Started

Web scraping is an important skill for any developer to have. It allows you to automatically gather data from websites and store it for later use. This tutorial will show you how to get started with web scraping using NodeJS.

Libraries

  1. Simplecrawler

  2. Cheerio

  3. Puppeteer

  4. Playwright

  5. Axios

  6. Unirest

  7. Superagent

  8. Nightmare

We will talk about the most important thing which has to be kept in mind during data extraction.

Simplecrawler

Simplecrawler is designed to provide a basic, flexible, and robust API for crawling websites. It was written to archive, analyze, and search some very large websites and has happily chewed through hundreds of thousands of pages and written tens of gigabytes to disk without issue. It has a flexible queue system that can be frozen to disk and defrosted.

Example

To understand Simplecrawler we are going to scrape this website. I am assuming that you have installed Nodejs, and along with that, you have a working directory where we will save our Nodejs script. So, the first thing is to install Simplecrawler.

npm install --save simplecrawler

I have created a scraper.js file in my folder. Inside that file, write.

var Crawler = require("simplecrawler");
var crawler =new Crawler("https://books.toscrape.com/");

We supply the constructor with a URL that indicates which domain to crawl and which resource to fetch first. You can configure 3 things before scraping the website.

a) Request Interval

crawler.interval = 10000;

b) Concurrency of requests

crawler.maxConcurrency=3;

c) Number of links to fetch

crawler.maxDepth  1; 

//Or:

crawler.maxDepth  2;

This library also provides more properties which can be found here.

You’ll also need to set up event listeners for the events you want to listen to. crawler.fetchcomplete and crawler.complete are good places to start.

crawler.on("fetchcomplete", function(queueItem, responseBuffer, response) {

  console.log("I just received %s (%d bytes)", queueItem.url,      responseBuffer.length); 

  console.log("It was a resource of type %s", response.headers['content-type']);

});

Then, when you’re satisfied and ready to go, start the crawler! It’ll run through its queue, finding linked resources on the domain to download until it can’t find any more.

crawler.start();

Pros

  1. Adjust headers and respect robots.txt

  2. Lots of customization properties are available.

  3. easy setup using event listeners.

Cons

  1. The biggest disadvantage is it does not support Promise.

  2. Error handling

  3. It will also try to fetch the invalid URLs due to its brute force approach.

Cheerio

Cheerio is a library that is used to parse HTML and XML documents. You can use jquery syntax with the downloaded data. Cheerio removes all the DOM inconsistencies and browser cruft from the jQuery library, revealing its truly gorgeous API.

You can filter out the data you want using selectors. Cheerio works with a very simple, consistent DOM model. As a result, parsing, manipulating, and rendering are incredibly efficient. Preliminary end-to-end benchmarks suggest that cheerio is about 8x faster than JSDOM.

Example

We will scrape the header line from this URL which says “Books to Scrape”

First, you have to install cheerio

npm install cheerio

Then type the following code to extract the desired text.

const cheerio = require(‘cheerio’)
const axios = require(‘axios’);

var scraped_data =await axios.get("https://books.toscrape.com");

const $ = cheerio.load(scraped_data.data)
var name = $(".page_inner").first().find("a").text();
console.log(name)

//Books to Scrape

First, we have made an HTTP request to the website and then we have stored the data to scraped_data. We will load it in cheerio and then use the class name to get the data.

Pros

  1. Data parsing & extraction becomes very easy.

  2. Already configured methods are available.

  3. API is fast.

Cons

  1. It cannot parse JavaScript.

Puppeteer

Puppeteer is a Node library that provides a high-level API to control Chrome or Chromium over the DevTools Protocol.

Puppeteer runs headless by default but can be configured to run full (non-headless) Chrome or Chromium. It can also be changed to watch the execution live in non-headless mode.

It removes the dependency on any external driver to run the operation. Puppeteer provides better control over Chrome.

Example

We are going to scrape this website. First, install the puppeteer library.

npm i puppeteer --save

Then in your scraper.js file write the following code.

const browser = await puppeteer.launch({headless: false})
const page = await browser.newPage();
var results = await page.goto('https://books.toscrape.com/');
await page.waitFor(1000);  
await browser.close();
console.log(results);

Awesome! Let’s break it down line by line:

First, we create our browser and set headless mode to false. This allows us to watch exactly what is going on:

const browser = await puppeteer.launch({headless: false});

Then, we create a new page in our browser:

const page = await browser.newPage();

Next, we go to the books.toscrape.com URL:

var results = await page.goto('https://books.toscrape.com/');

I’ve added a delay of 1000 milliseconds. While normally not necessary, this will ensure everything on the page loads:

await page.waitFor(1000);

Finally, after everything is done, we’ll close the browser and print our results.

await browser.close();
console.log(results);

The setup is complete. The data is ready!

Pros

  • Puppeteer allows access to the measurement of loading and rendering times provided by the Chrome Performance Analysis tool.

  • Puppeteer removes the dependency on an external driver to run the tests.

Cons

  • Puppeteer is limited to Chrome browser only for now until Firefox support is completed

  • Puppeteer has a smaller testing community using the tool currently, there is more test-specific support for Selenium

Playwright

Playwright is a Node.js library to automate Chromium, Firefox, and WebKit with a single API very similar to Puppeteer. The playwright is built to enable cross-browser web automation that is ever-green, capable, reliable, and fast. From automating tasks and testing web applications to data mining.

Example

We will build a simple scraper to demonstrate the application of playwright. We will scrape the first book from this URL.

Now we’ll install Playwright.

npm i playwright

Building a scraper

Creating a scraper with Playwright is surprisingly easy, even if you have no previous scraping experience. If you understand JavaScript and CSS, it will be a piece of cake.

In your project folder, create a file called scraper.js (or choose any other name) and open it in your favorite code editor. First, we will confirm that Playwright is correctly installed and working by running a simple script.

const playwright = require('playwright');
async function main() {
    const browser = await playwright.chromium.launch({headless: false});
    const page = await browser.newPage();
    var results= await page.goto('https://books.toscrape.com/catalogue/a-light-in-the-attic_1000/index.html');
    await page.waitForTimeout(10000);
    await browser.close();
}

main();

If you saw a Chromium window open and the Book page successfully loaded congratulations, you just robotized your web browser with Playwright!

results variable has all the data. Now, you can use Cheerio to get all the information.

Clicking buttons is extremely easy with Playwright. By prefixing text= to a string you’re looking for, Playwright will find the element that includes this string and click it. It will also wait for the element to appear if it’s not rendered on the page yet.

This is a huge advantage over puppeteer. Once you have clicked you have to wait for the page to load and then use Cheerio to get the information you are looking for. But for now, we are not going in that direction.

Pros

  1. Clicking buttons is way easier than Puppeteer.

  2. Cross-browser support.

  3. Documentation is great

Cons

  1. They have not patched the actual rendering engine.

HTTP Clients

An HTTP Client can be used to send requests to a server and retrieve their responses. We will discuss 3 libraries that are simply used to make an HTTP request to the server or the web page which you are trying to scrape.

Axios

It is a promise-based HTTP client for both browser and node.js. It will provide us with the complete HTML code of the target website. Making a request using Axios is quite simple and straightforward.

var axios = require(‘axios’);



async function main() {

 try{
     var scraped_data =await axios.get("https://books.toscrape.com/");
     console.log(scraped_data.data);



     //<DOCTYPE HTML>......//
 }catch(err){
     console.log(err)
 }

}




 main();

You can install Axios through the following command

npm i axios --save

Pros

  1. It has interceptors to modify the request.

  2. Supports promise.

  3. Error handling is great.

Unirest

Unirest is a set of lightweight HTTP libraries available in multiple languages, built and maintained by Kong, who also maintain the open-source API Gateway Kong.

Using Unirest is similar to how we use Axios. You can use it as an alternative for Axios.

var unirest = require(‘unirest’);

async function main() {


 try{
     var scraped_data =await unirest.get("https://books.toscrape.com/");
     console.log(scraped_data.body);
     //<DOCTYPE HTML>……//
 }catch(err){
     console.log(err)
 }

}


main();

You can install Unirest through the following command

npm i unirest --save

Pros

  1. Auto support for gzip

  2. File transfer is simple.

  3. you can send the request directly by providing a callback along with the URL.

Superagent

Small progressive client-side HTTP request library, and Node.js module with the same API, supporting many high-level HTTP client features. It has a similar API like Axios and supports promise and async/await syntax.

const superagent = require**(‘superagent’)**;

const superagent = require(‘superagent’);

async function main() {


 try{
     var scraped_data =await superagent.get("https://books.toscrape.com/");
     console.log(scraped_data.text);
     //<DOCTYPE HTML>……//
 }catch(err){
     console.log(err)
 }

}

main();

You can install Superagent through the following command

npm i superagent --save

Pros

  • Multiple functions chaining to send requests.

  • Numerous plugins available for many common features

  • Works in both Browser and Node

Cons

  1. It’s API does not adhere to any standard.

Nightmare

Nightmare is a high-level browser automation library from Segment. It uses Electron(the same Google Chrome-derived framework that powers the Atom text editor) which is similar to PhantomJs but twice as fast and a bit modern too. It was originally designed for automating tasks across sites that don’t have APIs but is most often used for UI testing and crawling.

Nightmare is an ideal choice over Puppetteer if you don’t like the heavy bundle it comes up with. Your scraper will bypass a lot of the annoying code that can trip it up. Not only that, it means websites that render mostly on the client-side are still scrape-able — if ever you’ve been thrown by needing to make an AJAX request to return a form in your scraping, today is your day to be awesome!

You can install nightmare library by running the following command:

npm install nightmare

Once Nightmare is installed we will find Scrapingdog’s website link through the Duckduckgo search engine.

const Nightmare = require(‘nightmare’)

const nightmare = Nightmare()

nightmare
 .goto('https://duckduckgo.com')
 .type(‘#search_form_input_homepage’, ‘Scrapingdog’)
 .click(‘#search_button_homepage’)
 .wait(‘#links .result__a’)
 .evaluate(() => document.querySelector(‘#links .result__a’).href)
 .end()
 .then((link) => {
 console.log(‘Scrapingdog Web Link:’, link)


 })
 .catch((error) => {
 console.error(‘Search failed:’, error)
 })

Now, we’ll go line by line. First, we have created an instance of Nightmare. Then we’ll open the Duckduckgo search engine using .goto.

Then we will fetch the search bar by using its selector. We have changed the value of the search box to “Scrapingdog” using .type. Once all this is done we are going to submit it. Nightmare will wait till the first link has loaded and after that, it will use the DOM method to get the value of href attribute. After receiving the link it will print it on the console.

Pros

  1. It’s a great use case for ES7 await keyword.

  2. Wait feature to pause the browser.

  3. Lighter than Puppeteer.

Cons

  1. Undiscovered vulnerabilities may exist in Electron that could allow a malicious website to execute code on your computer.

Is Nodejs Good for Web Scraping?

Node.js is a javascript runtime environment that is frequently used for web scraping. node.js has many features that make it a good choice for web scraping, such as its asynchronous nature and support for Promises.

Some drawbacks of using node.js for web scraping include the need to manage multiple dependencies and the potential for performance issues.

Final Takeway

So, these were some open source web scraping tools and libraries that you can use for web scraping projects. If you just want to focus on data collection then you can always use web scraping API.

Feel free to comment and ask me anything. You can follow me on Twitter and Medium. Thanks for reading, and please hit the like button!

Frequently Asked Questions

Is NodeJS good for web scraping?

For extracting data from a website Node.js is good. Even though Python is still more popular for web scraping, node.js does the job well. Nodejs can handle large amounts of traffic more efficiently than Python in production.

Additional Resources

And there’s the list! At this point, you should feel comfortable writing your first web scraper to gather data from any website. Here are a few additional resources that you may find helpful during your web scraping journey: