A workshop and associated R package
Web scraping is a powerful tool for mining large amounts of data. Though Python has traditionally been the preferred language for scraping, R offers an assortment of competitive packages for tasks ranging from simple scrapes to crawling the web. In this talk, I teach the essentials of web scraping via a custom-built R package, how2scrape, available from my github.
The entire workshop is bundled into my R package, how2scrape. Below are the instructions for how to install it and follow along with the workshop. I plan to update the package in the future, extending coverage to include crawling with RSelenium. There’s a possibility I’ll create a similar tutorial for crawling with Python too. Any updates will be posted to my blog.
Instructions for the workshop:
We will be conducting the web scraping workshop in R. To be able to follow along and get the most out of the session, some preparation is in order.
Second, I have bundled the entire workshop into an R package. Please download it from my github repository. To do so, run the following code in R:
install.packages('devtools') devtools::install_github('EandrewJones/how2scrape', build_vignettes = TRUE)
Note: You must include
build_vignettes = TRUE, otherwise the lab will not work. Also, the package only works in
R version >= 3.5.1, so please make sure your version of R is updated. If you are still using an outdated version, think of this as a friendly nudge.
The can take a fairly long time to download and compile. Don’t worry, this is normal. The package scrapes ~10k bills from Congress as it compiles, so please be patient. If you run into any issues, please leave me a message via email or on github.