Web data scraping with `rvest`

data

rvest

web-scraping

Learn how to extract valuable data from the web with ease and efficiency

Author

Affiliation

Student at Mathematical Science program, Faculty of Science, Universiti Brunei Darussalam and Statistics Lecturer at Universitas Islam Negeri Mataram, Indonesia

Published

February 5, 2024

In today’s data-centric era, accessing and analyzing online information has become crucial for businesses, researchers, and enthusiasts. However, manually retrieving data from websites is often tedious and ineffective. Enter web scraping – a method that automates data extraction from web pages.

Among the top choices for web scraping in R is the rvest package. Created by Hadley Wickham, rvest offers an intuitive interface for navigating web pages, selecting HTML elements, and extracting desired data seamlessly.

With rvest, users can scrape various types of web data, including text, tables, images, and more. Whether gathering product prices from online stores, collating news headlines from media sites, or extracting weather forecasts, rvest streamlines the process with its user-friendly functions.

To begin using rvest, simply install the package from CRAN and load it into your R environment. From there, utilize functions like read_html, html_nodes, and html_text to retrieve HTML content, pinpoint specific elements via CSS selectors or XPath expressions, and extract the desired data.

Example 1

For example, consider the task of scraping article titles from a news platform such as detik.com. Using the rvest package, one can efficiently retrieve the HTML content of the page, pinpoint the CSS selector corresponding to the article titles, and extract the textual data. Within a short span, this approach yields a structured dataset composed of headlines, primed for further analysis. The code snippet below illustrates how to leverage rvest, in conjunction with the tidyverse package in R, to collect news titles from a website:

#install.packages("rvest") 
library(rvest)

detik.com <- read_html("https://news.detik.com/") #Define/read the url

detik.com %>% html_nodes("h2.media__title") %>%  #point out the spesific node
  html_text() %>%                                #that represent the news title
  trimws()

[1] "Pertimbangan dan Alasan PDIP Pecat Jokowi, Ada soal MK hingga Langgar Etik"
[2] "Ada Sosok Pria Telanjang Dada Saat Anak Bos Toko Roti Ditangkap, Siapa?"   
[3] "Isi Surat Pemecatan Jokowi, Gibran Rakabuming dan Bobby dari PDIP"         
[4] "Surat Pemecatan Jokowi-Gibran-Bobby dari PDIP Diteken Langsung Megawati"   
[5] "PDIP Umumkan Pecat Jokowi, Gibran, dan Bobby Nasution!"                    
[6] "Anggota DPR Ingatkan Potensi Cuaca Ekstrem saat Libur Natal dan Tahun Baru"

Moreover, rvest seamlessly integrates with other R packages such as dplyr and ggplot2, and even purrr facilitating data manipulation, visualization, and analysis tasks. Whether constructing predictive models, conducting sentiment analysis, or crafting interactive dashboards, rvest complements your data science workflow effectively.

Example 2

In the following R script example, we employ the rvest library to scrape web content from “news.detik.com”. Our focus is specifically on extracting data from <a> tags that possess the class attribute media__link. This targeted approach allows us to efficiently gather link-related information directly associated with media content on the website.This is accomplished by selecting the appropriate nodes using html_nodes(). Following node selection, the map_df() function from the purrr package is employed to iterate over each node. Within this function, a small data frame is created for each node, where html_text() extracts the visible text of the link (i.e., the news title), and html_attr("href") retrieves the corresponding URL. The trimws() function is applied to clean up the extracted titles by removing any excess whitespace. The result is a tidy data frame, named news_data, containing columns for both the cleaned titles and their associated URLs, ready for further analysis or use in subsequent operations. This approach ensures a structured and efficient method for collecting web data, which can be crucial for tasks such as content aggregation, monitoring, or data analysis in digital journalism and other fields.

library(purrr)
library(tibble)

# Define and read the URL
detik.com <- read_html("https://news.detik.com/")

# Extract news titles and their corresponding links
news_data <- detik.com %>%
  html_nodes("a.media__link") %>%
  map_df(~ tibble(
    title = html_text(.x) %>% trimws(), # Extract and trim the text
    link = html_attr(.x, "href")        # Extract the href attribute
  ))

# View the results
head(news_data$link,20)

 [1] "https://www.detik.com/sulsel/pilkada/d-7689404/pj-bupati-jeneponto-dalami-kasus-asn-dinsos-ubah-kis-warga-gegara-pilkada"    
 [2] "https://www.detik.com/sulsel/pilkada/d-7689404/pj-bupati-jeneponto-dalami-kasus-asn-dinsos-ubah-kis-warga-gegara-pilkada"    
 [3] "https://www.detik.com/sulsel/pilkada/d-7689404/pj-bupati-jeneponto-dalami-kasus-asn-dinsos-ubah-kis-warga-gegara-pilkada"    
 [4] "https://www.detik.com/sulsel/pilkada/d-7689396/dana-kampanye-paslon-pilwalkot-makassar-mulia-tertinggi-inimi-paling-sedikit" 
 [5] "https://www.detik.com/sulsel/pilkada/d-7689371/oknum-asn-dinsos-jeneponto-ubah-kis-warga-ancam-pecat-korban-dari-honorer"    
 [6] "https://news.detik.com/pilkada/d-7689315/hari-terakhir-rekapitulasi-suara-pilkada-di-2-provinsi-masih-belum-tuntas"          
 [7] "https://www.detik.com/sulsel/pilkada/d-7689128/dana-kampanye-appi-aliyah-rp-13-6-m-paling-tinggi-di-pilwalkot-makassar"      
 [8] "https://news.detik.com/pilkada/d-7688255/terungkap-komunikasi-pks-dengan-kubu-pramono-selepas-pilkada-jakarta"               
 [9] "https://news.detik.com/pilkada/d-7688217/gerindra-dki-belum-komunikasi-dengan-pramono-kami-kawal-program-warga"              
[10] "https://news.detik.com/pilkada/d-7688169/pdip-sambut-baik-komunikasi-pramono-dan-pks-hal-positif"                            
[11] "https://news.detik.com/pilkada/d-7688104/psi-dorong-pramono-kerja-sama-dengan-prabowo-atasi-masalah-di-jakarta"              
[12] "https://news.detik.com/pilkada/d-7688016/cak-lontong-jamin-pramono-bakal-rangkul-semua-pihak-termasuk-pks"                   
[13] "https://www.detik.com/sulsel/pilkada/d-7687096/pesan-trisal-ome-agar-gugatan-fkj-nur-bukan-soal-ijazah-palsu"                
[14] "https://www.detik.com/sulsel/pilkada/d-7687093/perkara-pilkada-bikin-asn-dinsos-jeneponto-ubah-kis-warga-jadi-meninggal"     
[15] "https://news.detik.com/pilkada/d-7686763/rano-akan-bagi-tugas-urus-jakarta-mas-pram-di-meja-saya-eksekutornya"               
[16] "https://www.detik.com/sulsel/pilkada/d-7686691/dinsos-jeneponto-akan-aktifkan-lagi-kis-warga-diubah-meninggal-gegara-pilkada"
[17] "https://www.detik.com/sulsel/pilkada/d-7686297/trisal-ome-ingatkan-fkj-nur-bedakan-sengketa-proses-dan-hasil-pilkada-palopo" 
[18] "https://www.detik.com/sulsel/pilkada/d-7686261/fkj-nur-gugat-hasil-pilkada-palopo-ke-mk-trisal-ome-sudah-siapkan-pembelaan"  
[19] "https://www.detik.com/sulsel/pilkada/d-7686388/viral-oknum-asn-dinsos-jeneponto-ubah-kis-warga-jadi-meninggal-gegara-pilkada"
[20] "https://www.detik.com/jogja/pilkada/d-7686378/polisi-limpahkan-berkas-kasus-politik-uang-di-minggir-sleman-ke-kejaksaan"

In summary, rvest empowers R practitioners to leverage the abundance of web-based information and convert it into actionable insights. By automating web scraping, rvest enhances efficiency, boosts productivity, and opens up new avenues for data-driven decision-making. Whether you’re a novice or a seasoned data scientist, rvest serves as a valuable asset for extracting knowledge from the web.

Citation

BibTeX citation:

@online{puteri2024,
  author = {Puteri, Indira},
  title = {Web Data Scraping with `Rvest`},
  date = {2024-02-05},
  url = {https://indiraputeri.github.io/posts/2024-03-29-post/},
  langid = {en}
}

For attribution, please cite this work as:

Puteri, Indira. 2024. “Web Data Scraping with `Rvest`.” February 5, 2024. https://indiraputeri.github.io/posts/2024-03-29-post/.