Unlocking the Web: Data Scraping Techniques

Nithin M

Kerala Agricultural University

Jan 3, 2026

The Problem

Many economic datasets are published online but not easily downloadable:
- 📊 Government economic reports (tables in PDFs/websites)
- 🏦 Central bank statistics (embedded in HTML pages)
- 📈 Stock prices, exchange rates (live data on financial sites)
- 🏠 Real estate data, commodity prices
- 📰 News archives, economic indicators
Your options:
1. ❌ Manually copy-paste (tedious, error-prone, not reproducible)
2. ✅ Web scraping (automated, reproducible, scalable)

Web Scraping

Web scraping = Automatically extracting structured data from websites

Website HTML → Parse → Extract → Convert → Analyze

Why economists should care:

✅ Collect data at scale (100s of tables in minutes)
✅ Reproducible methodology (others can verify)
✅ Automation (run on schedule, update data regularly)
✅ Custom datasets (extract exactly what you need)
✅ Research integrity (transparent data sources)

Real-World Economic Examples

Types of data scraped:

Tables
Charts (images, SVG)
Text (headlines, articles)

Sources:

World Bank: Poverty rates, GDP, development indicators
Central Banks: Interest rates, inflation, exchange rates
Stock Markets: Prices, volumes, earnings
Academic Databases: Papers, citations, data

The Tools Landscape

Tool	Language	Use Case
datapasta	R	Copy-paste
Beautiful Soup	Python	Static HTML
Scrapy	Python	Large projects
rvest	R	Static HTML
Selenium	Python/R	JavaScript sites

datapasta

datapasta = R package for super easy table copying

How it works:

Select & copy a table from website (Ctrl+C)
In RStudio: Addins → Paste as data.frame
Instant R code! ✨

Why datapasta Isn’t Enough

❌ Manual process — copy-paste each table individually

❌ Not reproducible — no code documenting how you got the data

❌ Doesn’t scale — impractical for 50+ tables

❌ No automation — can’t schedule weekly updates

❌ Limited scope — only HTML tables, no cards/divs/text

❌ Hidden tables miss — if page has 5 tables and you copy 1, you miss 4

datapasta = great for learning, but NOT for real research!

Beautiful Soup (Python)

What it does: - Parses HTML into structured objects - Finds elements by tag, ID, class, CSS selectors - Extracts text, attributes, tables - Cleans up whitespace

rvest (R)

What it does: - Reads HTML pages - Finds elements using CSS selectors - Extracts tables, text, attributes - Integrates with tidyverse

Scrapy

Scrapy = advanced Python framework for large-scale projects

When needed:

Extract from 100s/1000s of pages
Complex data pipelines
Automated scheduling
Database integration

When NOT needed:

Learning web scraping
Small projects (1-50 tables)
One-time data collection

HTML Structure: The Basics

DevTools: Finding Selectors

In browser (F12):

Right-click table → Inspect Element
See HTML structure
Look for id="something" or class="something"
Copy selector into your code

Most important skill for web scraping!

Python - Beautiful Soup

Set -up

install requets and bs4 packages
parse HTML with Beautiful Soup
- from file, URL, or string

from bs4 import BeautifulSoup as bs
import requests

# From a file
with open("demo_page.html","r") as f:
    soup = bs(f, "lxml")

# From a URL
url = "https://example.com"
response = requests.get(url)
soup = bs(response.content, "lxml")

# From a string
html_text = "<table><tr><td>data</td></tr></table>"
soup = bs(html_text, "lxml")

# View parsed HTML
print(soup.prettify())

`find` and `find_all` methods

find = first match
find_all = all matches
also use attributes like id, class
nesting to go deeper
we can also use regex for advanced searching

CSS Selectors

we can also use CSS selectors with select_one and select methods for a list of all CSS selectors, see

🔴Important🔴 - select always returns a list

head = soup.select("h2")
head_p = soup.select("p~h2")
p_head = soup.select("h2~p")
paras = soup.select("div p")

getting text and attributes

.get_text() or .text to extract text
.get() to extract attributes like id, class, href
.string - but use only when we do not have nesting
navigation, sibling,parent, child

Challenge

Extract all “Exercises from the website” given

Solution

import requests
from bs4 import BeautifulSoup as bs

# From local file
with open("resources/demo_page.html", "r", encoding="utf-8") as f:
    html_content = f.read()

# Parse HTML
soup = bs(html_content, "lxml")

# Print
print(soup.prettify())

# Select all section divs
sections = soup.select("div section")

# Extract exercises from each section
exercises = [h2 for i in sections if (h2 := i.find("h2")) and h2.get_text(strip=True).startswith("Exercise")]

R - rvest

Set -up

install rvest package
parse HTML with rvest
- from file, URL, or string

library(rvest) 
# From a file 
soup <- read_html("demo_page.html") 

# From a URL
url <- "https://example.com" 
soup <- read_html(url) 
 
# From a string 
html_text \<- "
   <table>
   <tr>
   <td>data</td>
   </tr>
   </table>
   " 
soup <- read_html(html_text)

# View parsed HTML

print(soup)

`html_element` and `html_elements` methods

html_element = first match
html_elements = all matches
also use attributes like id, class (prefix with # and .)

CSS Selectors

we can also use CSS selectors with html_element and html_elements methods for a list of all CSS selectors, see 🔴Important🔴 - html_elements always returns a nodeset

getting text and attributes

html_text() and html_text2 to extract text
html_attr() to extract attributes like id, class, href
html_tables - get tables

Challenge

Extract all “Exercises from the website” given

Solution

library(rvest)
# From local file
soup <- read_html("resources/demo_page.html")
# Print
print(soup)
# Select all section divs
sections <- html_elements(soup, "div section")
# Extract exercises from each section
exercies <- sections |> 
  html_elements("h2") |>
  html_text2() |>
  stringr::str_subset("^Exercise")
print(exercies)

Further Exercises

Save the exercises and their descriptions into a CSV file.

eg:

Exercise	Description
Exercise 1	Basic Economic Data - GDP
Exercise 2	Multiple Tables Extraction

Extract all tables from the webpage and save them as separate CSV files.

Solution

import pandas as pd
# Extract tables
tables = soup.find_all("table")
for i, table in enumerate(tables):
    df = pd.read_html(str(table))[0]
    df.to_csv(f"table_{i+1}.csv", index=False)

# Extract tables
tables <- html_elements(soup, "table")
for (i in seq_along(tables)) {
  df <- html_table(tables[[i]], fill = TRUE)
  write.csv(df, paste0("table_", i, ".csv"), row.names = FALSE)
}

Unlocking the Web: Data Scraping Techniques

The Problem

Web Scraping

Why economists should care:

Real-World Economic Examples

Types of data scraped:

Sources:

The Tools Landscape

datapasta

How it works:

Why datapasta Isn’t Enough

Beautiful Soup (Python)

rvest (R)

Scrapy

When needed:

When NOT needed:

HTML Structure: The Basics

DevTools: Finding Selectors

In browser (F12):

Python - Beautiful Soup

Set -up

find and find_all methods

CSS Selectors

getting text and attributes

Challenge

Solution

R - rvest

Set -up

html_element and html_elements methods

CSS Selectors

getting text and attributes

Challenge

Extract all “Exercises from the website” given

Solution

Further Exercises

Solution

`find` and `find_all` methods

`html_element` and `html_elements` methods