Kerala Agricultural University
Jan 3, 2026
Many economic datasets are published online but not easily downloadable:
Your options:
Web scraping = Automatically extracting structured data from websites
Website HTML → Parse → Extract → Convert → Analyze
| Tool | Language | Use Case |
|---|---|---|
| datapasta | R | Copy-paste |
| Beautiful Soup | Python | Static HTML |
| Scrapy | Python | Large projects |
| rvest | R | Static HTML |
| Selenium | Python/R | JavaScript sites |
datapasta = R package for super easy table copying
❌ Manual process — copy-paste each table individually
❌ Not reproducible — no code documenting how you got the data
❌ Doesn’t scale — impractical for 50+ tables
❌ No automation — can’t schedule weekly updates
❌ Limited scope — only HTML tables, no cards/divs/text
❌ Hidden tables miss — if page has 5 tables and you copy 1, you miss 4
datapasta = great for learning, but NOT for real research!
What it does: - Parses HTML into structured objects - Finds elements by tag, ID, class, CSS selectors - Extracts text, attributes, tables - Cleans up whitespace
What it does: - Reads HTML pages - Finds elements using CSS selectors - Extracts tables, text, attributes - Integrates with tidyverse
Scrapy = advanced Python framework for large-scale projects
id="something" or class="something"Most important skill for web scraping!
requets and bs4 packagesfrom bs4 import BeautifulSoup as bs
import requests
# From a file
with open("demo_page.html","r") as f:
soup = bs(f, "lxml")
# From a URL
url = "https://example.com"
response = requests.get(url)
soup = bs(response.content, "lxml")
# From a string
html_text = "<table><tr><td>data</td></tr></table>"
soup = bs(html_text, "lxml")
# View parsed HTML
print(soup.prettify())find and find_all methodsfind = first matchfind_all = all matchesid, classwe can also use CSS selectors with select_one and select methods for a list of all CSS selectors, see
🔴Important🔴 - select always returns a list
.get_text() or .text to extract text.get() to extract attributes like id, class, href.string - but use only when we do not have nestingExtract all “Exercises from the website” given
import requests
from bs4 import BeautifulSoup as bs
# From local file
with open("resources/demo_page.html", "r", encoding="utf-8") as f:
html_content = f.read()
# Parse HTML
soup = bs(html_content, "lxml")
# Print
print(soup.prettify())
# Select all section divs
sections = soup.select("div section")
# Extract exercises from each section
exercises = [h2 for i in sections if (h2 := i.find("h2")) and h2.get_text(strip=True).startswith("Exercise")]install rvest package
parse HTML with rvest
html_element and html_elements methodshtml_element = first matchhtml_elements = all matchesid, class (prefix with # and .)we can also use CSS selectors with html_element and html_elements methods for a list of all CSS selectors, see 🔴Important🔴 - html_elements always returns a nodeset
html_text() and html_text2 to extract texthtml_attr() to extract attributes like id, class, hrefhtml_tables - get tableslibrary(rvest)
# From local file
soup <- read_html("resources/demo_page.html")
# Print
print(soup)
# Select all section divs
sections <- html_elements(soup, "div section")
# Extract exercises from each section
exercies <- sections |>
html_elements("h2") |>
html_text2() |>
stringr::str_subset("^Exercise")
print(exercies)eg:
| Exercise | Description |
|---|---|
| Exercise 1 | Basic Economic Data - GDP |
| Exercise 2 | Multiple Tables Extraction |
https://github.com/nithinmkp/flame-economiga