# My Personal R Package

##### Jun 02, 2022

The semester is over, so what better time to get to work on my personal R package, ladtools. Most of the functions are simple wrappers that improve my workflow.

Find the package on my GitHub here.

### geom_lm() and scatter()

I fit a lot of linear regressions and I do not enjoy the base::plot() and base::abline() syntax for quick visualization. So instead, I built two functions, geom_lm() and scatter().

geom_lm() is a wrapper for geom_smooth() with nicer defaults. Instead of fitting a LOESS model, it fits simple OLS, does not plot standard errors, and does not return that pesky warning when the formula is not declared.

ggplot(midwest) +
aes(x = percollege, y = percbelowpoverty) +
geom_point() +
geom_smooth() +
theme_blog()
## geom_smooth() using method = 'loess' and formula 'y ~ x'

That warning annoys me to no end.

ggplot(midwest) +
aes(x = percollege, y = percbelowpoverty) +
geom_point() +
geom_lm() +
theme_blog()

scatter() replicates Stata’s scatter command’s most frequent use case: quickly plotting data and a linear trend line through it. scatter() is just ggplot(), geom_point(), and geom_lm() combined into one call.

scatter(midwest, percbelowpoverty, percollege) +
theme_blog()

### is_increasing() and is_decreasing()

These functions are wrappers meant to increase code readability.1 They are mostly self explanatory. The strictly parameter governs whether repeated values count as increasing/decreasing or not. It defaults to FALSE, allowing repeated values.

vec <- c(1, 1, 2, 3)

is_increasing(vec)
## [1] TRUE
is_increasing(vec, strictly = T)
## [1] FALSE

### calculate_outlier_value()

In many introductory statistics classes, students are taught that values outside 3 standard deviations from the mean or 1.5 times the interquartile range plus or minus the 75th or 25th percentiles, respectively, are outliers. This test disappears quickly in most statistics curricula (for good reason), but I find it useful to understand the tails of my data, to rapidly test for influential points, and to adjust visualizations.

Boxplots are an excellent visual tool for understanding the distance outliers are from the rest of the data. With larger $$n$$, that method begins to fail. For a quick diagnosis, trimming outliers is quite convenient.

Influential point diagnostics exist for many models as well, often involving refitting the model without the outlying point. Trimming does this with all outlying points, a first look into the impact of influential points.

Visualizations also run into problems with outliers, especially with gradient color scales. One outlier can dramatically alter the scale, minimizing the differences between most of the distribution. Filtering or setting outliers to NA is a shortcut that sacrifices little integrity to visualize the majority of the distribution properly.

### theme_blog()

Standard ggplot2 visualizations look decent, but anyone publishing graphs for a website or organization should do better. After thousands of graphs, the gray background looks a bit dated, and who decided the standard font should be Arial?

Here’s a standard ggplot2 graph:

ggplot2-included theme_bw() cleans up image:

Custom themes are best built on top of a prior theme:

theme_blog <- function() {
theme_bw(base_size = 11, base_family = "Verdana") %+replace%

The %+replace operator updates the new theme based on theme_bw(). The font now matches the site.

Next, theme() arguments specify aspects of the theme. Custom themes are intimidating at first because they are verbose and isolated: they rely on little outside ggplot2. However, a detailed custom theme requires knowing only four functions of the element_ family — element_blank(), element_rect(), element_line(), and element_text() — and margins() to control margins.

First, I make everything behind the plot transparent:

    theme(
# Make everything transparent
panel.background = element_blank(),
plot.background = element_rect(
fill = "transparent",
colour = NA
),
legend.key = element_rect(
fill = "transparent",
colour = NA
),

Next, I eliminate tick marks because they are redundant with panel lines across the entire plots. Without tick marks, the labels along the axes are a

      # Eliminate tick marks
axis.ticks = element_blank(),

Next, I center and enlarge the title and subtitle.

      # Adjust text elements
plot.title = element_text(
size = 16,
face = "bold",
hjust = .5, # center align
vjust = 1,
margin = margin(t = 8, b = 5)
),
plot.subtitle = element_text(
size = 12,
margin = margin(t = 1, b = 5)
),
plot.caption = element_text(
size = 8,
hjust = 1
),

Since the tick marks are gone, the variable names on the axes and the axis labels need adjustment.

      axis.title = element_text(size = 10),
axis.text = element_text(size = 9),
axis.text.x = element_text(
margin = margin(1, b = 5)
),
axis.text.y = element_text(
margin = margin(r = .5, l = 5)
),

When positioned inside the plot, I appreciate a background for the legend. I override this setting fairly often.

      # Legend settings
legend.background = element_rect(
fill = "light gray",
color = "black",
size = .3
),
legend.title = element_text(size = 7),
legend.text = element_text(
size = 7,
margin = margin(t = 0, b = 0)
),
legend.key.size = unit(.65, "lines"),

I decided against minor grid lines because they make the plot so busy.

      # Remove minor grid lines
panel.grid.minor = element_blank()
)
}

And that’s it! Here is what the graph looks like with theme_blog():

ggplot(mtcars) +
aes(x = hp, y = mpg) +
geom_point(aes(color = factor(cyl))) +
geom_lm() +
labs(
x = "Horsepower",
y = "Miles Per Gallon",
color = "No. of Cylinders",
caption = "mtcars data",
title = "A Basic Scatterplot",
subtitle = "Greater Horsepower Corresponds with Lower Fuel Efficiency") +
theme_blog() +
theme(legend.position = c(.8, .8))

1. In all honesty, I lost points in statistics classes using is.unsorted() because the grader did not understand what was happening. I wrote wrappers months ago and packaged them for convenience.↩︎