Unlocking the Web: Data Scraping Techniques

Nithin M

Kerala Agricultural University

Jan 3, 2026

The Problem

  • Many economic datasets are published online but not easily downloadable:

    • 📊 Government economic reports (tables in PDFs/websites)
    • 🏦 Central bank statistics (embedded in HTML pages)
    • 📈 Stock prices, exchange rates (live data on financial sites)
    • 🏠 Real estate data, commodity prices
    • 📰 News archives, economic indicators
  • Your options:

    1. ❌ Manually copy-paste (tedious, error-prone, not reproducible)
    2. Web scraping (automated, reproducible, scalable)

Web Scraping

Web scraping = Automatically extracting structured data from websites

Website HTML → Parse → Extract → Convert → Analyze

Why economists should care:

  • ✅ Collect data at scale (100s of tables in minutes)
  • ✅ Reproducible methodology (others can verify)
  • ✅ Automation (run on schedule, update data regularly)
  • ✅ Custom datasets (extract exactly what you need)
  • ✅ Research integrity (transparent data sources)

Real-World Economic Examples

Types of data scraped:

  • Tables
  • Charts (images, SVG)
  • Text (headlines, articles)

Sources:

  • World Bank: Poverty rates, GDP, development indicators
  • Central Banks: Interest rates, inflation, exchange rates
  • Stock Markets: Prices, volumes, earnings
  • Academic Databases: Papers, citations, data

The Tools Landscape

Tool Language Use Case
datapasta R Copy-paste
Beautiful Soup Python Static HTML
Scrapy Python Large projects
rvest R Static HTML
Selenium Python/R JavaScript sites

datapasta

datapasta = R package for super easy table copying

How it works:

  1. Select & copy a table from website (Ctrl+C)
  2. In RStudio: Addins → Paste as data.frame
  3. Instant R code!

Why datapasta Isn’t Enough

Manual process — copy-paste each table individually

Not reproducible — no code documenting how you got the data

Doesn’t scale — impractical for 50+ tables

No automation — can’t schedule weekly updates

Limited scope — only HTML tables, no cards/divs/text

Hidden tables miss — if page has 5 tables and you copy 1, you miss 4

datapasta = great for learning, but NOT for real research!

Beautiful Soup (Python)

What it does: - Parses HTML into structured objects - Finds elements by tag, ID, class, CSS selectors - Extracts text, attributes, tables - Cleans up whitespace

rvest (R)

What it does: - Reads HTML pages - Finds elements using CSS selectors - Extracts tables, text, attributes - Integrates with tidyverse

Scrapy

Scrapy = advanced Python framework for large-scale projects

When needed:

  • Extract from 100s/1000s of pages
  • Complex data pipelines
  • Automated scheduling
  • Database integration

When NOT needed:

  • Learning web scraping
  • Small projects (1-50 tables)
  • One-time data collection

HTML Structure: The Basics

DevTools: Finding Selectors

In browser (F12):

  1. Right-click table → Inspect Element
  2. See HTML structure
  3. Look for id="something" or class="something"
  4. Copy selector into your code

Most important skill for web scraping!

Python - Beautiful Soup

Set -up

  • install requets and bs4 packages
  • parse HTML with Beautiful Soup
    • from file, URL, or string
from bs4 import BeautifulSoup as bs
import requests

# From a file
with open("demo_page.html","r") as f:
    soup = bs(f, "lxml")

# From a URL
url = "https://example.com"
response = requests.get(url)
soup = bs(response.content, "lxml")

# From a string
html_text = "<table><tr><td>data</td></tr></table>"
soup = bs(html_text, "lxml")

# View parsed HTML
print(soup.prettify())

find and find_all methods

  • find = first match
  • find_all = all matches
  • also use attributes like id, class
  • nesting to go deeper
  • we can also use regex for advanced searching

CSS Selectors

we can also use CSS selectors with select_one and select methods for a list of all CSS selectors, see

🔴Important🔴 - select always returns a list

head = soup.select("h2")
head_p = soup.select("p~h2")
p_head = soup.select("h2~p")
paras = soup.select("div p")

getting text and attributes

  • .get_text() or .text to extract text
  • .get() to extract attributes like id, class, href
  • .string - but use only when we do not have nesting
  • navigation, sibling,parent, child

Challenge

Extract all “Exercises from the website” given

Solution

import requests
from bs4 import BeautifulSoup as bs

# From local file
with open("resources/demo_page.html", "r", encoding="utf-8") as f:
    html_content = f.read()

# Parse HTML
soup = bs(html_content, "lxml")

# Print
print(soup.prettify())

# Select all section divs
sections = soup.select("div section")

# Extract exercises from each section
exercises = [h2 for i in sections if (h2 := i.find("h2")) and h2.get_text(strip=True).startswith("Exercise")]

R - rvest

Set -up

  • install rvest package

  • parse HTML with rvest

    • from file, URL, or string
library(rvest) 
# From a file 
soup <- read_html("demo_page.html") 

# From a URL
url <- "https://example.com" 
soup <- read_html(url) 
 
# From a string 
html_text \<- "
   <table>
   <tr>
   <td>data</td>
   </tr>
   </table>
   " 
soup <- read_html(html_text)

# View parsed HTML

print(soup)

html_element and html_elements methods

  • html_element = first match
  • html_elements = all matches
  • also use attributes like id, class (prefix with # and .)

CSS Selectors

we can also use CSS selectors with html_element and html_elements methods for a list of all CSS selectors, see 🔴Important🔴 - html_elements always returns a nodeset

getting text and attributes

  • html_text() and html_text2 to extract text
  • html_attr() to extract attributes like id, class, href
  • html_tables - get tables

Challenge

Extract all “Exercises from the website” given

Solution

library(rvest)
# From local file
soup <- read_html("resources/demo_page.html")
# Print
print(soup)
# Select all section divs
sections <- html_elements(soup, "div section")
# Extract exercises from each section
exercies <- sections |> 
  html_elements("h2") |>
  html_text2() |>
  stringr::str_subset("^Exercise")
print(exercies)

Further Exercises

  1. Save the exercises and their descriptions into a CSV file.

eg:

Exercise Description
Exercise 1 Basic Economic Data - GDP
Exercise 2 Multiple Tables Extraction
  1. Extract all tables from the webpage and save them as separate CSV files.

Solution

import pandas as pd
# Extract tables
tables = soup.find_all("table")
for i, table in enumerate(tables):
    df = pd.read_html(str(table))[0]
    df.to_csv(f"table_{i+1}.csv", index=False)
# Extract tables
tables <- html_elements(soup, "table")
for (i in seq_along(tables)) {
  df <- html_table(tables[[i]], fill = TRUE)
  write.csv(df, paste0("table_", i, ".csv"), row.names = FALSE)
}