In this blog post, we will discuss how to use the data.table
package in R to fill NA with the previous value. This is a common task when cleaning datasets with missing values, especially in cases where the missing values can be logically filled with the most recent valid entry. Previously, we have learned how to find the first non-NA value in a data.table, but know we will learn how to take the previous value and fill this missing value.
The data.table
package is designed to handle large datasets and offers concise syntax for data manipulation tasks. One of these tasks is replacing missing values (NAs) with the last observed non-missing value, a technique known as last observation carried forward (LOCF). This approach is often used in time-series data or with grouped observations.
Table of Contents
- How to Fill NA with the Previous Value Using data.table
- Comparing Performance: data.table vs dplyr
- Summary
- Resources
How to Fill NA with the Previous Value Using data.table
Here is an example of how to fill NA values in data.table
:
# Fill NA with previous value within each group
dt[, value := nafill(value, type = "locf"), by = id]
Code language: CSS (css)
In the code example above, we use the nafill()
function with the type = "locf"
argument to replace missing values with the last observed value within each group. The by = id
part ensures the operation is applied to each group separately.
You can also use "nocb"
to fill NAs with the next valid value, depending on the needs of your dataset.
Comparing Performance: data.table vs dplyr
Another popular R package for manipulating data is dplyr
. However, compared to data.table
, dplyr
is generally less efficient in terms of memory usage and speed when handling large datasets. Let us compare the two methods to fill NA values with the previous value and observe the performance difference.
library(data.table)
library(dplyr)
library(zoo)
library(microbenchmark)
# Set a seed for reproducibility
set.seed(123)
# Create a large dataset with 1 million rows and 5 columns
n <- 1e6
dt <- data.table(id = rep(1:100, each = n/100),
value1 = rnorm(n),
value2 = rnorm(n),
value3 = rnorm(n),
value4 = rnorm(n),
value5 = rnorm(n))
# Introduce some NAs in the 'value' columns
for(i in 2:5) {
dt[[paste0("value", i)]] <- ifelse(runif(n) < 0.1, NA, dt[[paste0("value", i)]])
}
# Perform microbenchmarking for both methods
benchmark_result <- microbenchmark(
data_table_method = dt[, (paste0("value", 1:5)) := lapply(.SD, zoo::na.locf), .SDcols = paste0("value", 1:5)],
dplyr_method = dt %>%
group_by(id) %>%
mutate(across(starts_with("value"), zoo::na.locf, na.rm = FALSE)),
times = 10
)
# Print the benchmark results
print(benchmark_result)
Code language: R (r)

The results show that the data.table_method
performs significantly faster than the dplyr_method
. The minimum time for data.table_method
is around 353 milliseconds, whereas dplyr_method
takes approximately 1060 milliseconds, more than three times longer. The median time for data.table_method
is 464.7 milliseconds, while dplyr_method
reaches 1165.9 milliseconds. These results show us that data.table
is more efficient in filling NA
values when working with large datasets.
Summary
In this post, we demonstrated how to use data.table
NA values to fill the previous value. This technique is valuable in various data cleaning scenarios, especially when working with time-series or grouped data.
If you found this post helpful, feel free to share it or leave a comment below. We would appreciate hearing about your experiences with data cleaning in R.
Resources
Here are some more helpful data.table tutorials:
- How to Filter in data.table in R
- data.table Count Rows by Group
- How to Select Columns from data.table in R
- How to Sum Multiple Columns in data.table