Most data arrives messy, confusing, and stubbornly resistant to analysis. The tidyverse isn’t just a collection of R packages; it’s a philosophy that turns data wrangling from a frustrating puzzle into an intuitive conversation. Think of it as learning to speak data’s native language, where every operation feels natural and every transformation makes sense.
First Contact: Getting Your Data Talking
Before we dive into complex operations, let’s set the stage. The tidyverse works best when your data follows three simple rules:
- Each variable lives in its own column
- Each observation lives in its own row
- Each type of data lives in its own table
When your data follows this “tidy” structure, everything becomes easier.
r
library(tidyverse)
# Meet our dataset: monthly subscription metrics
subscriptions <- tibble(
user_id = c(“U001”, “U002”, “U003”, “U004”, “U005”),
plan_type = c(“premium”, “basic”, “premium”, “enterprise”, “basic”),
signup_date = as.Date(c(“2024-01-15”, “2024-02-01”, “2024-01-20”, “2024-03-10”, “2024-02-28”)),
monthly_revenue = c(29.99, 9.99, 29.99, 99.99, 9.99),
months_active = c(8, 7, 8, 6, 7),
support_tickets = c(2, 5, 1, 3, 8)
)
print(“Our starting point:”)
print(subscriptions)
The Essential Verbs: What Would You Naturally Ask Your Data?
Data manipulation boils down to answering questions. The tidyverse gives you intuitive verbs for each type of question.
“Show me only the premium users” → filter()
r
premium_users <- subscriptions %>%
filter(plan_type == “premium”)
print(“Our premium subscribers:”)
print(premium_users)
“Just show me their IDs and revenue” → select()
r
user_financials <- subscriptions %>%
select(user_id, plan_type, monthly_revenue)
print(“Financial snapshot:”)
print(user_financials)
“What’s each customer’s total lifetime value?” → mutate()
r
with_lifetime_value <- subscriptions %>%
mutate(
lifetime_value = monthly_revenue * months_active,
value_per_ticket = lifetime_value / support_tickets # Cost per support interaction
)
print(“Customers with lifetime value:”)
print(with_lifetime_value)
“Who are our most valuable customers?” → arrange()
r
by_value <- with_lifetime_value %>%
arrange(desc(lifetime_value))
print(“Customers ranked by value:”)
print(by_value)
Seeing Patterns: Grouped Thinking
The real magic happens when you stop looking at individual rows and start seeing groups and patterns.
“How does each plan type perform?” → group_by() + summarise()
r
plan_performance <- subscriptions %>%
group_by(plan_type) %>%
summarise(
total_customers = n(),
avg_lifetime_value = mean(monthly_revenue * months_active),
avg_support_tickets = mean(support_tickets),
retention_months = mean(months_active)
) %>%
ungroup() # Good practice to ungroup after summarising
print(“Plan performance breakdown:”)
print(plan_performance)
“Which customers are costing us too much in support?”
r
support_analysis <- subscriptions %>%
mutate(
lifetime_value = monthly_revenue * months_active,
support_cost_estimate = support_tickets * 25, # $25 per ticket
profitability_score = (lifetime_value – support_cost_estimate) / lifetime_value
) %>%
filter(profitability_score < 0.7) %>% # Highlight problematic accounts
arrange(profitability_score)
print(“Customers with support cost concerns:”)
print(support_analysis)
Joining Forces: Connecting Related Data
Real-world data usually lives in multiple tables. Joins help you reconnect these separated pieces.
r
# Additional user demographic data
user_demographics <- tibble(
user_id = c(“U001”, “U002”, “U003”, “U004”, “U005”),
company_size = c(“individual”, “small”, “individual”, “enterprise”, “small”),
industry = c(“tech”, “education”, “healthcare”, “finance”, “retail”)
)
# Customer feedback scores
feedback_scores <- tibble(
user_id = c(“U001”, “U003”, “U005”),
satisfaction_score = c(9, 8, 6)
)
# Bring it all together
complete_view <- subscriptions %>%
left_join(user_demographics, by = “user_id”) %>%
left_join(feedback_scores, by = “user_id”)
print(“Complete customer picture:”)
print(complete_view)
Reshaping Data: The Art of Perspective
Sometimes you need to change how your data is organized to answer different questions.
From Wide to Long: Making Data Tidy
r
# Suppose we have quarterly revenue in wide format
quarterly_revenue_wide <- tibble(
user_id = c(“U001”, “U002”, “U003”),
Q1_2024 = c(29.99, 9.99, 29.99),
Q2_2024 = c(29.99, 9.99, 29.99),
Q3_2024 = c(29.99, 9.99, 29.99)
)
print(“Wide format (hard to analyze):”)
print(quarterly_revenue_wide)
# Transform to long format
quarterly_revenue_long <- quarterly_revenue_wide %>%
pivot_longer(
cols = starts_with(“Q”),
names_to = “quarter”,
values_to = “revenue”
)
print(“Long format (easy to analyze):”)
print(quarterly_revenue_long)
From Long to Wide: Creating Summary Views
r
# Sometimes you need wide format for reporting
plan_summary_wide <- subscriptions %>%
group_by(plan_type) %>%
summarise(
avg_months = mean(months_active),
avg_revenue = mean(monthly_revenue)
) %>%
pivot_wider(
names_from = plan_type,
values_from = c(avg_months, avg_revenue)
)
print(“Wide format summary for reporting:”)
print(plan_summary_wide)
Handling Real-World Messiness
Clean data is the exception, not the rule. Here’s how to handle common data quality issues.
r
# Sample data with real-world problems
messy_subscriptions <- tibble(
user_id = c(“U006”, “U007”, “U008”, “U009”, NA),
plan_type = c(“PREMIUM”, “basic”, “premium “, “enterprise”, “basic”),
signup_date = c(“2024-01-15”, “invalid_date”, “2024-01-20”, “2024-03-10”, “2024-02-28”),
monthly_revenue = c(29.99, 9.99, 29.99, NA, 9.99)
)
print(“The messy reality:”)
print(messy_subscriptions)
# Cleaning pipeline
clean_data <- messy_subscriptions %>%
# Handle missing values
drop_na(user_id) %>%
# Standardize text
mutate(
plan_type = str_to_lower(plan_type) %>% str_trim(),
plan_type = case_when(
plan_type == “premium” ~ “premium”,
plan_type == “basic” ~ “basic”,
plan_type == “enterprise” ~ “enterprise”,
TRUE ~ “other”
)
) %>%
# Parse dates safely
mutate(
signup_date = as.Date(signup_date, format = “%Y-%m-%d”),
signup_date = if_else(is.na(signup_date), as.Date(“2024-01-01”), signup_date)
) %>%
# Handle missing revenue
mutate(
monthly_revenue = replace_na(monthly_revenue, 0)
)
print(“After cleaning:”)
print(clean_data)
Working with Dates and Times
Temporal data requires special handling, and lubridate makes it intuitive.
r
library(lubridate)
time_analysis <- subscriptions %>%
mutate(
signup_year = year(signup_date),
signup_month = month(signup_date, label = TRUE),
signup_quarter = quarter(signup_date),
days_since_signup = as.integer(Sys.Date() – signup_date),
# Business-specific time logic
cohort = floor_date(signup_date, “month”),
is_long_term = months_active > 6
)
print(“Temporal analysis:”)
print(time_analysis %>% select(user_id, signup_date, cohort, is_long_term))
Advanced Patterns: Thinking in Pipelines
As you get comfortable, you’ll start building sophisticated analysis pipelines.
r
# Comprehensive customer health scoring
customer_health <- subscriptions %>%
left_join(feedback_scores, by = “user_id”) %>%
mutate(
# Calculate metrics
lifetime_value = monthly_revenue * months_active,
ticket_ratio = support_tickets / months_active,
# Score components (0-10 scale)
value_score = scales::rescale(lifetime_value, to = c(0, 10)),
retention_score = scales::rescale(months_active, to = c(0, 10)),
support_score = 10 – scales::rescale(ticket_ratio, to = c(0, 10)),
satisfaction_score = replace_na(satisfaction_score, 5),
# Overall health score
health_score = 0.3 * value_score + 0.3 * retention_score +
0.2 * support_score + 0.2 * satisfaction_score,
# Classification
health_tier = case_when(
health_score >= 8 ~ “Excellent”,
health_score >= 6 ~ “Good”,
health_score >= 4 ~ “Needs Attention”,
TRUE ~ “At Risk”
)
) %>%
arrange(desc(health_score))
print(“Customer health assessment:”)
print(customer_health %>% select(user_id, plan_type, health_score, health_tier))
Scaling Up: The Same Logic, Bigger Data
The beautiful part? These same patterns work whether you’re analyzing 100 rows or 100 million rows.
r
# For database-backed data
library(DBI)
library(dbplyr)
# Connect to a remote database
# con <- dbConnect(RPostgres::Postgres(), …)
# The code looks exactly the same!
# big_analysis <- tbl(con, “subscriptions”) %>%
# filter(plan_type == “premium”) %>%
# group_by(signup_year = year(signup_date)) %>%
# summarise(total_revenue = sum(monthly_revenue)) %>%
# collect()
Conclusion: From Data Mechanic to Data Storyteller
Mastering the tidyverse transforms your relationship with data. You stop being a mechanic who struggles with rusty tools and become a storyteller who can make data reveal its secrets.
The key mindset shifts:
- Think in verbs, not functions – What do you want to do with your data?
- Build pipelines, not scripts – Each step naturally flows to the next
- Embrace consistency – The same patterns work across different problems
- Focus on questions – Let your analytical questions drive the code, not the other way around
The examples we’ve covered—filtering, selecting, mutating, grouping, joining, reshaping—are your basic vocabulary. With practice, you’ll start combining them into elegant sentences and paragraphs that tell compelling data stories.
Remember, clean data wrangling isn’t about writing clever code—it’s about writing clear code. Code that your future self will understand, that your colleagues can follow, and that reliably produces accurate results.
Now you’re not just manipulating data; you’re having a conversation with it. And like any good conversation, the more you practice, the more natural it becomes. So go ahead—ask your data some interesting questions and see what stories it has to tell.