How to Scrape the Web using R

A workshop and associated R package

Abstract

Web scraping is a powerful tool for mining large amounts of data. Though Python has traditionally been the preferred language for scraping, R offers an assortment of competitive packages for tasks ranging from simple scrapes to crawling the web. In this talk, I teach the essentials of web scraping via a custom-built R package, how2scrape, available from my github.

Date
Event
UMD Government and Politics Methods Workshop Fall 2018
Location
Dept. of Government and Politics, UMD

The entire workshop is bundled into my R package, how2scrape. Below are the instructions for how to install it and follow along with the workshop. I plan to update the package in the future, extending coverage to include crawling with RSelenium. There’s a possibility I’ll create a similar tutorial for crawling with Python too. Any updates will be posted to my blog.

Instructions for the workshop:

We will be conducting the web scraping workshop in R. To be able to follow along and get the most out of the session, some preparation is in order.

First, download and install the CSS selector gadget plug-in if you do not already have it. This nifty tool is a tremendous aid when web scraping, especially for beginners who are not familiar with javascript, html, and xml/xpath.

Second, I have bundled the entire workshop into an R package. Please download it from my github repository. To do so, run the following code in R:

install.packages('devtools')
devtools::install_github('EandrewJones/how2scrape', build_vignettes = TRUE)

Note: You must include build_vignettes = TRUE, otherwise the lab will not work. Also, the package only works in R version >= 3.5.1, so please make sure your version of R is updated. If you are still using an outdated version, think of this as a friendly nudge.

The can take a fairly long time to download and compile. Don’t worry, this is normal. The package scrapes ~10k bills from Congress as it compiles, so please be patient. If you run into any issues, please leave me a message via email or on github.

Avatar
Evan Jones
Sofware Engineer & Cat Dad