Skip to contents

parallelize execution of RSelenium

Usage

parscrape(
  scrape_fun,
  scrape_input,
  cores = NULL,
  packages = c("base"),
  browser,
  ports = NULL,
  chunk_size = NULL,
  scrape_tries = 1,
  proxy = NULL,
  extraCapabilities = list()
)

Arguments

scrape_fun

a function with input x sending instructions to remDr (remote driver)/ scraping function to be parallelized

scrape_input

a data frame, list, or vector where each element is an input to be passed to scrape_fun

cores

number of cores to run RSelenium instances on. Defaults to available cores - 1.

packages

a character vector with package names of packages used in scrape_fun

browser

a character vector specifying the browser to be used

ports

vector of ports for RSelenium instances. If left at default NULL parscrape will randomly generate ports.

chunk_size

number of scrape_input elements to be processed per round of scrape_fun. parscrape splits scrape_input into chunks and runs scrape_fun in multiple rounds to avoid loosing data due to errors. Defaults to number of cores.

scrape_tries

number of times parscrape will re-try to scrape a chunk when encountering an error

proxy

a proxy setting function that runs before scraping each chunk

extraCapabilities

a list of extraCapabilities options to be passed to rsDriver

Value

a list containing the elements: scraped_results and not_scraped. scraped_results is a list containing the output of scrape_fun. If there are no unscraped input elements then not_scraped is NULL. If there are unscraped elements not_scraped is a data.frame containing the scrape_input id, chunk id and associated error of all unscraped input elements.

Examples

if (FALSE) {
input <- c(".central-textlogo__image",".central-textlogo__image")

scrape_fun <- function(x){
 input_i <- x
 remDr$navigate("https://www.wikipedia.org/")
 element <- remDr$findElement(using = "css", input_i)
 element <- element$getElementText()
 return(element)
}

parsel_out <- parscrape(scrape_fun = scrape_fun,
                       scrape_input = input,
                       cores = 2,
                       packages = c("RSelenium"),
                       browser = "firefox",
                       scrape_tries = 1,
                       chunk_size = 2,
                       extraCapabilities = list(
                        "moz:firefoxOptions" = list(args = list('--headless'))
                        )
                       )
}