Streamlining R Workflows: A Practitioner’s Guide to Data Management

Data Management

Tutorial

Published

Mar 28, 2025

Introduction

If the first line of your R script is
setwd("C:\\Users\\jenny\\path\\that\\only\\I\\have")
I will come into your office and SET YOUR COMPUTER ON FIRE 🔥.

If the first line of your R script is
rm(list = ls())
I will come into your office and SET YOUR COMPUTER ON FIRE 🔥.

— Jenny Bryan

Data management is a crucial aspect of any data analysis workflow in R. Whether you are a beginner or a seasoned practitioner, effectively renaming variables and managing your datasets can significantly influence your productivity and the clarity of your analysis.

In this guide, we’ll explore common methods that users typically employ to rename variables and create labels in R, followed by a more structured and practical approach that can enhance efficiency and maintainability in your data management practices.

Setting Up Your RStudio Project

Before we dive into data management techniques, it’s essential to organize your work efficiently. I recommend starting by creating a new project in RStudio. This will help you manage your files and set a working directory automatically.

Open RStudio.
Go to File -> New Project....
Choose New Directory -> New Project.
Name your project and select a directory.
Click Create Project.

With your project set up, you can easily manage your files and scripts, ensuring that your working directory is always pointing to the correct folder.

Sample Data

Let us create a small sample data

# Sample data frame
data <- data.frame(
  id = 1:5,
  age = c(21, 25, 30, 22, 28),
  gender = factor(c("Male", "Female", "Male", "Female", "Female")),
  income = c(50000, 60000, 70000, 80000, 90000),
  education = c(1, 2, 3, 2, 1) # Numeric variable for education
)

library(here) # for relative path
# Save the dataset using the here package
write_csv(data, file = here("data", "dummy_data.csv"))

This will store our dummy data in the project folder.

Traditional Approaches to Data Wrangling

In many data projects, tasks like renaming, dropping, labeling variables, and loading packages are treated as one-off steps scattered across multiple scripts. This can make your workflow hard to follow and difficult to reproduce.

Dropping some variables

Dropping variables is a key operation in data cleaning. Let us see how we do it

Base-R
Tidyverse

# Drop 'education' column by excluding it
data <- data[, names(data) != "education"]


# Alternatively, keep only specific columns
data <- data[, c("id", "age", "gender", "income")]

library(dplyr)

# Drop 'education' column using dplyr
data <- data %>% select(-education)

# Keep only specific columns
data <- data %>% select(id, age, gender, income)

Renaming

Renaming variables is essential for clarity and consistency, especially when dealing with raw data that may use cryptic or inconsistent naming conventions, long variable names etc.

Many users tend to rely on base R functions to rename variables within their data frames. The most common method involves using the names() function or colnames() function to directly set new variable names. Here’s how typical users might approach this task:

colnames(data) <- c("respondent_id", "age", "gender", "income", "education_level")
head(data)

a more familiar user of R might prefer using tidyverse approach which can be of two methods

Using the dplyr package, you can rename variables in two ways: with the rename() function or by using the select() function. Below is an example of both approaches

Using rename()
Using select()

# Load dplyr
library(dplyr)

# Sample data frame
data <- data.frame(
  id = 1:5,
  age = c(21, 25, 30, 22, 28),
  gender = factor(c("Male", "Female", "Male", "Female", "Female")),
  income = c(50000, 60000, 70000, 80000, 90000),
  education = c(1, 2, 3, 2, 1) # Numeric variable for education
)

# Renaming variables using rename()
data <- data %>%
  rename(
    respondent_id = id,
    education_level = education
  )

# Display updated names
head(data)

# Load dplyr
library(dplyr)

# Sample data frame
data <- data.frame(
  id = 1:5,
  age = c(21, 25, 30, 22, 28),
  gender = factor(c("Male", "Female", "Male", "Female", "Female")),
  income = c(50000, 60000, 70000, 80000, 90000),
  education = c(1, 2, 3, 2, 1) # Numeric variable for education
)

# Renaming variables using select()
data <- data %>%
  select(
    respondent_id = id,
    age,
    gender,
    income,
    education_level = education
  )

# Display updated names
head(data)

Labelling

Variable labels provide descriptive metadata that help users (including future you!) understand the meaning of each variable. This is especially useful in large projects or when preparing data for sharing. R users often overlook the importance of creating labels for their variables. When they do decide to create labels, they typically use the labelled package or similar approaches. Here’s how they might go about it:

Base R (with Hmisc)
Tidyverse (with labelled)

# install.packages("Hmisc") # if not already installed
library(Hmisc)

label(data$age) <- "Age in years"
label(data$gender) <- "Gender of respondent"
label(data$income) <- "Annual income in USD"
label(data$education) <- "Education level"

# install.packages("labelled") # if not already installed
library(labelled)

data <- data %>%
  set_variable_labels(
    age = "Age in years",
    gender = "Gender of respondent",
    income = "Annual income in USD",
    education = "Education level (1=Primary, 2=Secondary, 3=Tertiary)"
  )

Organizing Workflow: A Practical Take

While the above methods work well and are widely used, they sometimes lack scalability and can lead to messy code, particularly when managing larger datasets or conducting repetitive analyses.

Setting Up the Environment

The first step in creating a reproducible working environment in R is setting up your packages and dependencies in a clean, consistent manner.

Why `pacman`?

Over the years, I’ve tried various strategies to streamline package management — including writing custom functions to install and load required libraries. These functions usually looped through a list of package names, checked if they were installed, installed those that were missing, and then loaded them. While this worked, it was often verbose and harder to maintain over time.

Now, I prefer using pacman, which offers the same functionality with a cleaner, more robust interface.

if (!require("pacman")) install.packages("pacman")
pacman::p_load(
  tidyverse, haven, janitor, labelled, skimr,
  readxl, writexl, here, stringr
)

This one-liner installs any missing packages and loads them, eliminating the need to manually manage install.packages() and library() calls — and avoids the common practice of commenting out installation lines, which can silently break code for others or even your future self.

My Earlier Approach Before adopting pacman, I wrote a custom function to do just this — install and load a list of packages in a single go. You can view that function here:

Custom Package Loader Function (GitHub Gist)

Beyond pacman: Full Environment Reproducibility

For long-term and collaborative projects, it’s beneficial to complement pacman with tools that enhance reproducibility and robustness — such as renv, targets, and Docker.

These tools help manage dependencies, structure workflows, and containerize your environment for seamless collaboration and future-proofing. I’ll discuss each of them in more detail later.

Rename, Drop, Label — the Smart Way with Data Dictionaries

A data dictionary — basically a tidy rectangular table that lists variable names, their definitions, labels, and other attributes — is one of the most important things you’ll ever create in your research workflow.

Most people think of it as something you make at the end, just to help interpret your dataset. But trust me, it’s way more useful if you start building it before you even collect or clean your data. It acts like a roadmap — helping you stay organized from raw data all the way to your final cleaned dataset.

With this approach:

You won’t lose track of how or why you renamed a variable.
You’ll always know which variables were dropped and what the logic was.
You’ll have consistent, clear labels — no guesswork later on.

It’s not just about reproducibility (though that’s huge). It’s also about clarity — for your collaborators, and more importantly, your future self. This way, your data workflow becomes transparent, structured, and way easier to manage as your project grows.

Integrating Data Dictionaries into Your Workflow

Data dictionaries are ideally created at the beginning of a project — they act as a roadmap for variable naming, labeling, and documentation. While ideal is not always the norm, it’s never too late to start building one. Even midway through a project, a thoughtfully crafted data dictionary can bring structure and clarity to your data workflow.

Traditionally, data dictionaries or codebooks are treated as standalone documentation. But integrating them directly into your data cleaning and transformation pipeline has been a game changer for me. It ensures consistency across renaming, labeling, and dropping operations — all while making your code more readable and reproducible.

Creating a new data dictionary

you can create a new data dictionary by simply creating a new excel fileby incluing following as column names if you are begining a project.

Fields to Include	Optional Helpful Fields
Variable name	Skip patterns
Variable label (What is this item?)	Required item (Were participants allowed to skip this item?)
Variable type	Variable universe (Who got this item?)
Allowable values/range	Notes (such as versions/changes to this variable)
Assigned missing values	Associated scale/subscale
Recoding/calculations	Time periods this item is available (if study is longitudinal)

Credits: Crystal Lewis (Data Management in Educational Research)

I started integrating a data dictionary midway through my project, which meant I had to take a different path. My dataset had over 100 variables, making manual documentation impractical.

Instead of creating everything from scratch, I began building the dictionary programmatically alongside my cleaning workflow. This approach allowed me to scale efficiently, especially with datasets containing a large number of variables.

The code below demonstrates how to create a template data dictionary from an existing dataset:

# Create the data dictionary tibble
data_dictionary <- tibble(
  "NEW_NAME" = NA,
  OLD_NAME = colnames(data),
  LABEL = NA,
  VALUE = values,
  OLD_TYPE = apply(data, 2, typeof)
)

# Save the updated data dictionary
write_xlsx(data_dictionary,here("data_dictionary.xlsx"))

Once the data dictionary is created, we can begin integrating it into our cleaning workflow.

Dropping a variable

Let’s consider the case of dropping variables. Suppose we want to drop the variable age. In the data_dictionary, we simply set the new_name column for age to "drop". Then, we use the following code to filter those variables:

dict <- read_excel(here("data_dictionary.xlsx")) # read the data dictionary

drop_vars <- dict %>%
    filter(new_name == "drop") %>%
    pull(old_name)

data  <- data  |> 
select(-any_of(drop_vars))

While manually tagging new_name as “drop” in the data dictionary might seem tedious at first, it actually helps reduce typos and eliminates the need to repeat variable names throughout your script. Once it’s there in the dictionary, you don’t need to touch it again.

Renaming a variable

Renaming variables can quickly turn into a nightmare when working with a large number of them. Manually typing both old and new variable names inside the rename() function is not only tedious — it’s also error-prone.

This is where the data_dictionary saves the day. Instead of renaming variables directly in your script, you simply enter the new variable names in the new_name column of your dictionary. Then, use a similar approach as we did with dropping variables.

 rename_vars <- data_dictionary %>%
    select(new_name, old_name) %>%
    filter(!new_name == "drop") %>%
    deframe()

data <- data |> 
rename(any_of(rename_vars))

Labelling variables

Adding metadata in the form of variable labels was not even in my workflow untill recently and I belive it is common with most R users to not include lables — but trust me, it can make your life a lot easier. With labels in place, you don’t even have to open your data dictionary to get what a column means.it is quite helpful, especially when you’re revisiting a dataset after a while or sharing it with collaborators.

In her blog post on labelled data, Shannon Pileggi talks about how labelled data makes it easier to use metadata in tables and figures, which is great if you’re building data products.

There are multiple ways to do this in R, but my personal favorite is the codebook package along with labelled package.

Below is how I label variables using data dictionay and the codebook package.

var_labels <- dict |> 
select(new_name,label) %>% 
  filter(new_name!="drop") %>% 
  dict_to_list()

var_label(data) <- var_labels

Is that all? No!! We can go further — we can create value labels, extend this workflow to auto-generate a codebook, and even integrate it with your reporting pipeline.

I’ll dive into all that in some later posts.

For now — that’s it. Hope this gave you a helpful starting point for making your data cleaning process more structured and reproducible.

Till next time — bye!! 👋

References

Crystal Lewis. Data Management in Educational Research. https://datamgmtinedresearch.com/
Crystal Lewis. Creating a data dictionary (Tutorial).
Shannon Pileggi. Working with Labelled Data.
Jenny Bryan. Naming Things — on the importance of naming and organizing code and data files.
Jenny Bryan et al. Happy Git and GitHub for the useR. https://happygitwithr.com/
labelled package documentation

Disclaimer

This post was drafted with the assistance of AI copy-editing tools to enhance clarity and flow. Any remaining errors are entirely my own.