Web scraping by watching requests

22 octobre 2021 · 5 min read

Web scraping is a technique used to extract content from websites in order to archive data in a structured way. The data is extracted directly from the source code. The work consists in searching it in a jungle of HTML tags.

It is however much easier to retrieve this data in a structured format. This is possible by intercepting web requests between your browser and the website.

This method works, among others, on social networks or classifieds websites (but make sure to respect their terms of use).

Introduction

Websites contain a lot of Javascript. This language is used in particular to refresh the data in the browser without reloading the whole page. In this situation the web page sends an HTTP request to the web server to get the new data to be injected into the page. These are generally transmitted in JSON, a structured data format.

Let’s see how to intercept requests with Playwright and Electron.

Prerequisites

Before getting to the heart of the matter, it is necessary to identify the request you wish to intercept :

Open your browser and load the website you want to work on. If necessary, log in to the site with your credentials.

Open Developer Tools (key F12). Click on Networks. All HTTP requests between the browser and the website are listed. Reload the page or browse the site to see the activity. Among them is the one you are interested in. The goal is to find its name.

For example on a classified ads site that I won’t name, the search request receives a JSON containing the data to be displayed on the web page.

Request

In the Headers tab, the Request URL information gives you the full path to the URL where we can use a regular expression.

Playwright

Playwright is a tool developed in open-source by Microsoft. It allows to automate tests on the different internet browsers.

First install NodeJS and Visual Studio Code on your computer.

Create a folder and open a command prompt. Then type :

npm init

Press ENTER to validate the default options.

Let’s add Playwright to the project, as a development dependency:

npm i -D playwright

Open the folder with Visual Studio Code (or your favorite editor):

code .

Create the main.js file next to the package.json file:

const { chromium } = require('playwright')
async function main () {
  const browser = await chromium.launch({
    headless: false
  });
  const context = await browser.newContext();  
  const page = await context.newPage();
  page.on('response', (response) => {
    if (!/jsonsearch/.test(response.url())) { return }
    console.log(response.status(), response.url())
    response.json().then((json) => {
      console.log(json) // do what you have to do with this JSON
    }).catch((error) => {
      console.error(error)
    })
  });
  await page.goto('https://en.jeffprod.com');
}
main()

Explanation of the above code:

  • The search engine of this site jeffprod.com uses a JSON file. In this example we intercept the request that retrieves the index of the blog posts.
  • if (!/jsonsearch/.test(response.url())) { return } : We want to intercept the requests whose URL contains jsonsearch. Because as mentioned in the prerequisites paragraph, we have identified it beforehand. Here, we ignore URLs that do not contain this term via a regular expression.
  • console.log(response.status(), response.url()): we display, for information, the HTTP response code (normally 200) and the intercepted URL.
  • console.log(json): the intercepted JSON is available in this variable. For the rest, you are the only one who knows what to do with this data.

To test the program, in package.json add the following line in the scripts section:

"scripts": {
  "start": "node main.js"
}

You can then start the program with the command :

npm run start

Electron

Electron allows the creation of desktop applications for Windows, Mac and Linux with HTML, CSS and NodeJS. The application is a web browser based on Chromium.

Follow the official tutorial to create the basis of your application.

We will intercept the requests with the debugger API. Here is a modified example of the main.js file. The part related to the topic of this article is indicated between the comments Debugger API :

const { app, BrowserWindow } = require('electron')

let mainWin
let requestIDs = new Set()

function createWindow () {
  mainWin = new BrowserWindow({
    width: 800,
    height: 600
  })

  // -----------------------------------------------------------------------
  // Debugger API

  const getResponse = (requestId) => {
    mainWin.webContents.debugger.sendCommand('Network.getResponseBody', { requestId })
    .then(function(response) {
      let jsonresp
      try {
        jsonresp = JSON.parse(response.body)
      } catch (err) {
        console.error('invalid JSON', err)
        return
      }
      console.log(jsonresp) // do what you have to do with this JSON
    }).catch((err) => {
      console.error(err)
    })
  }

  mainWin.webContents.debugger.on('detach', (event, reason) => {
    console.log('Debugger detached due to : ', reason)
  })

  mainWin.webContents.debugger.on('message', (event, method, params) => {
    if (method === 'Network.requestWillBeSent' && /jsonsearch/.test(params.request.url)) {
      requestIDs.add(params.requestId)
    }
    if (method === 'Network.loadingFinished' && requestIDs.has(params.requestId)) {
      console.log('got loadingFinished for an interesting request')
      requestIDs.delete(params.requestId)
      getResponse(params.requestId)
    }
  })

  mainWin.webContents.debugger.attach('1.3')
  mainWin.webContents.debugger.sendCommand('Network.enable')
  mainWin.loadURL('https://en.jeffprod.com')

  // end debugger API
  // -----------------------------------------------------------------------
} // function createWindow

app.whenReady().then(() => {
  createWindow()
  app.on('activate', () => {
    if (BrowserWindow.getAllWindows().length === 0) {
      createWindow()
    }
  })
})

app.on('window-all-closed', () => {
  if (process.platform !== 'darwin') {
    app.quit()
  }
})

Replace /jsonsearch/ with the regular expression of the URL you want to intercept.

The technique is as follows:

  • if (method === 'Network.requestWillBeSent' && /jsonsearch/.test(params.request.url)) : if a request is going to be issued and we are interested in it (the URL contains jsonsearch), we save the ID of this request in the array variable requestIDs.
  • if (method === 'Network.loadingFinished' && requestIDs.has(params.requestId)): for any request whose state is finished, if its ID is in our save array, we get the request body via the getResponse function.

References

Chrome Network Domain


RELATED ARTICLES