Compare Two Tables

Compares two data.frames or tibbles by examining column names, row counts, key overlap, numeric discrepancies, and categorical discrepancies. Useful for validating data processing pipelines.

Usage

compare_tables(
  x,
  y,
  key_cols = NULL,
  tol = .Machine$double.eps,
  top_n = Inf,
  compare_cols = NULL,
  exclude_cols = NULL,
  on_non_unique = c("warn", "stop")
)

# S3 method for class 'compare_tbl'
print(x, show_n = 5L, ...)

# S3 method for class 'compare_tbl'
as.data.frame(x, row.names = NULL, optional = FALSE, ...)

Arguments

x: First data.frame or tibble to compare.
y: Second data.frame or tibble to compare.
key_cols: Character vector of column names to use as keys for matching rows. If NULL (default), automatically detects character, factor, and integer columns as keys.
tol: Numeric tolerance for comparing numeric columns. Differences less than or equal to tol are considered equal. Defaults to .Machine$double.eps (machine double-precision).
top_n: Maximum number of row-level discrepancies to store per column (numeric and categorical), and maximum unmatched keys to store. Defaults to Inf (all). Unmatched keys are stored in arbitrary order.
compare_cols: Character vector of column names to compare. If NULL (default), all common non-key columns are compared. Mutually exclusive with exclude_cols.
exclude_cols: Character vector of column names to exclude from comparison. If NULL (default), no columns are excluded. Mutually exclusive with compare_cols.
on_non_unique: What to do when the chosen key_cols do not form a primary key (rows are duplicated on the key, or a key column contains NA) in x or y. "warn" (default) issues a warning and proceeds — note that comparisons will be inflated by cartesian row expansion at the merge. "stop" aborts with the same message.
show_n: Maximum number of rows to display for discrepancies and unmatched keys in the printed output. Defaults to 5L.
...: Additional arguments (currently unused).
row.names: Passed to as.data.frame(). Default NULL.
optional: Passed to as.data.frame(). Default FALSE.

Value

An S3 object of class compare_tbl containing:

name_x, name_y: Names of the compared objects
common_columns: Column names present in both tables
only_x: Column names only in x
only_y: Column names only in y
type_mismatches: Data.frame of columns with different types, or NULL
nrow_x: Number of rows in x
nrow_y: Number of rows in y
key_summary: List summarising the chosen keys and their overlap, or NULL if no keys could be determined. Fields: keys, auto (logical), x_unique, y_unique, matches, only_x, only_y, is_pk_x, is_pk_y (logical: do keys uniquely identify rows in each table), n_dup_combos_x, n_dup_combos_y (number of key combinations appearing more than once), has_na_keys_x, has_na_keys_y (NA values present in any key column).
numeric_summary: Data.frame of numeric discrepancy quantiles (with n_over_tol count), or NULL
comparison_method: How columns were compared ("keys", "row_index", or NA)
rows_matched: Number of rows matched on keys
tol: The tolerance used
top_n: The top_n used
discrepancies: Data.frame of row-level numeric discrepancies exceeding tol (or where one side is NA), with key columns (or row_index), column, value_x, value_y, abs_diff, and pct_diff (relative difference as a proportion). NULL if none.
categorical_summary: Data.frame with column, n_compared, n_mismatched, pct_mismatched (proportion, 0–1), n_na_mismatch, or NULL
categorical_discrepancies: Data.frame of row-level categorical discrepancies with key columns (or row_index), column, value_x, value_y. NULL if none.
total_discrepancies: Total number of cell-level discrepancies across all column types (not limited by top_n)
only_x_keys: Data.frame of key combinations only in x (up to top_n rows), or NULL
only_y_keys: Data.frame of key combinations only in y (up to top_n rows), or NULL
match_summary: List with only_x, only_y, matched_no_disc, matched_with_disc, pct_no_disc (proportion, 0–1), pct_with_disc (proportion, 0–1)

Use as.data.frame() to extract all discrepancies (numeric and categorical) as a single tidy data.frame.

Examples

x <- data.frame(id = 1:3, value = c(10.0, 20.0, 30.0))
y <- data.frame(id = 1:3, value = c(10.1, 20.0, 30.5))
compare_tables(x, y)
#> 
#> ── Table Comparison: x vs y ────────────────────────────────────────────────────
#> 1. Row counts
#> x: 3 rows
#> y: 3 rows
#> Difference: =
#> 
#> 2. Column names
#> Matching columns: 2
#> Only in x: 0
#> Only in y: 0
#> Type mismatches: 0
#> 
#> 3. Key columns
#> Key columns: id (auto-detected)
#> Distinct combos in x: 3
#> Distinct combos in y: 3
#> Primary key in x: yes
#> Primary key in y: yes
#> 
#> 4. Row matching
#> Only in x: 0
#> Only in y: 0
#> Matched, no discrepancies: 1 (33.3%)
#> Matched, with discrepancies: 2 (66.7%)
#> Total cell discrepancies: 2 (2 numeric, 0 categorical)
#> 
#> 5. Numeric discrepancies (absolute differences)
#> Compared after merging on keys.
#> Rows matched: 3
#> 
#>     Column  N  Min   Q25  Median  Q75  Max  >tol
#>     ──────  ─  ───  ────  ──────  ───  ───  ────
#>     value   3    0  0.05     0.1  0.3  0.5     2
#> 
#> Top discrepancies:
#>     id  column  value_x  value_y  abs_diff  pct_diff
#>     ──  ──────  ───────  ───────  ────────  ────────
#>     3   value        30     30.5       0.5      1.6%
#>     1   value        10     10.1       0.1      1.0%
#> 
#> 6. Categorical discrepancies
#> No categorical discrepancies found.

# With tolerance — differences <= 0.15 are considered equal
compare_tables(x, y, tol = 0.15)
#> 
#> ── Table Comparison: x vs y ────────────────────────────────────────────────────
#> 1. Row counts
#> x: 3 rows
#> y: 3 rows
#> Difference: =
#> 
#> 2. Column names
#> Matching columns: 2
#> Only in x: 0
#> Only in y: 0
#> Type mismatches: 0
#> 
#> 3. Key columns
#> Key columns: id (auto-detected)
#> Distinct combos in x: 3
#> Distinct combos in y: 3
#> Primary key in x: yes
#> Primary key in y: yes
#> 
#> 4. Row matching (tol = 0.15)
#> Only in x: 0
#> Only in y: 0
#> Matched, no discrepancies: 2 (66.7%)
#> Matched, with discrepancies: 1 (33.3%)
#> Total cell discrepancies: 1 (1 numeric, 0 categorical)
#> 
#> 5. Numeric discrepancies (absolute differences)
#> Compared after merging on keys.
#> Rows matched: 3
#> 
#>     Column  N  Min   Q25  Median  Q75  Max  >tol
#>     ──────  ─  ───  ────  ──────  ───  ───  ────
#>     value   3    0  0.05     0.1  0.3  0.5     1
#> 
#> Top discrepancies:
#>     id  column  value_x  value_y  abs_diff  pct_diff
#>     ──  ──────  ───────  ───────  ────────  ────────
#>     3   value        30     30.5       0.5      1.6%
#> 
#> 6. Categorical discrepancies
#> No categorical discrepancies found.

# Categorical columns are also compared
a <- data.frame(id = 1:3, status = c("ok", "warn", "fail"),
                 stringsAsFactors = FALSE)
b <- data.frame(id = 1:3, status = c("ok", "warn", "error"),
                 stringsAsFactors = FALSE)
compare_tables(a, b)
#> 
#> ── Table Comparison: a vs b ────────────────────────────────────────────────────
#> 1. Row counts
#> a: 3 rows
#> b: 3 rows
#> Difference: =
#> 
#> 2. Column names
#> Matching columns: 2
#> Only in a: 0
#> Only in b: 0
#> Type mismatches: 0
#> 
#> 3. Key columns
#> Key columns: id and status (auto-detected)
#> Distinct combos in a: 3
#> Distinct combos in b: 3
#> Primary key in a: yes
#> Primary key in b: yes
#> 
#> 4. Row matching
#> Only in a: 1
#> Only in b: 1
#> Matched, no discrepancies: 2 (100%)
#> Matched, with discrepancies: 0 (0%)
#> 
#> Unmatched keys in a:
#>     id  status
#>     ──  ──────
#>     3   fail  
#> 
#> Unmatched keys in b:
#>     id  status
#>     ──  ──────
#>     3   error 
#> 
#> 5. Numeric discrepancies (absolute differences)
#> No common numeric columns found.
#> 
#> 6. Categorical discrepancies
#> No value columns compared.

Usage

Arguments

Value

See also

Examples