Analyzes a potential join between two data.frames or tibbles without performing the full join. Reports relationship type (one-to-one, one-to-many, etc.), match rates, duplicate keys, and unmatched rows. Optionally tracks a numeric statistic column through the join to quantify impact.
Arguments
- x
A data.frame or tibble (left table).
- y
A data.frame or tibble (right table).
- by
A character vector of column names to join on. Use a named vector
c("key_x" = "key_y")when column names differ between tables. Unnamed elements are used for both tables.- stat
Optional single column name (string) to track in both tables when the column name is the same. Ignored if
stat_xorstat_yis provided.- stat_x
Optional column name (string) for a numeric statistic in
x.- stat_y
Optional column name (string) for a numeric statistic in
y.- ...
Additional arguments (currently unused).
- object
A
validate_joinobject to summarize.
Value
An S3 object of class validate_join containing:
- x_name, y_name
Names of the input tables from the original call
- by_x, by_y
Key columns used for the join
- counts
List with row counts, match rates, and overlap statistics
- stat
When
stat,stat_x, orstat_yis provided, a list with stat diagnostics per table.NULLwhen no stat is provided.- duplicates
List with duplicate key information for each table
- summary_table
A data.frame summarizing the join diagnostics
- relation
Character string describing the relationship
- keys_only_in_x
Unmatched keys from x
- keys_only_in_y
Unmatched keys from y
See also
Other join validation:
compare_tables(),
validate_primary_keys(),
validate_var_relationship()
Examples
x <- data.frame(id = c(1L, 2L, 3L, 3L), value = c("a", "b", "c", "d"))
y <- data.frame(id = c(2L, 3L, 4L), score = c(10, 20, 30))
result <- validate_join(x, y, by = "id")
print(result)
#>
#> ── Join Validation: x ↔ y ──────────────────────────────────────────────────────
#> Keys in x: id
#> Keys in y: id
#>
#> Item Value
#> ─────────────────────────────── ───────────
#> Relationship many-to-one
#> Key(s) in x [id] (1 col)
#> Key(s) in y [id] (1 col)
#> Rows in x 4
#> Distinct key combos in x 3
#> Rows in y 3
#> Distinct key combos in y 3
#> Overlapping distinct key combos 2
#> Matched row pairs (cartesian) 3
#> Match rate from x 75.00%
#> Match rate from y 66.67%
#> Rows only in x (no match in y) 1
#> Rows only in y (no match in x) 1
#>
#> Duplicates: x=yes y=no
# Track a stat column with different names in each table
x2 <- data.frame(id = 1:3, sales = c(100, 200, 300))
y2 <- data.frame(id = 2:4, cost = c(10, 20, 30))
validate_join(x2, y2, by = "id", stat_x = "sales", stat_y = "cost")
#>
#> ── Join Validation: x2 ↔ y2 ────────────────────────────────────────────────────
#> Keys in x2: id
#> Keys in y2: id
#>
#> Item Value
#> ──────────────────────────────── ──────────
#> Relationship one-to-one
#> Key(s) in x2 [id] (1 col)
#> Key(s) in y2 [id] (1 col)
#> Rows in x2 3
#> Distinct key combos in x2 3
#> Rows in y2 3
#> Distinct key combos in y2 3
#> Overlapping distinct key combos 2
#> Matched row pairs (cartesian) 2
#> Match rate from x2 66.67%
#> Match rate from y2 66.67%
#> Rows only in x2 (no match in y2) 1
#> Rows only in y2 (no match in x2) 1
#>
#> ── Stat diagnostics ────────────────────────────────────────────────────────────
#>
#> sales in x2:
#> • Total: 600
#> • Matched: 500 (83.33%)
#> • Unmatched: 100 (16.67%)
#>
#> cost in y2:
#> • Total: 60
#> • Matched: 30 (50.00%)
#> • Unmatched: 30 (50.00%)
#>
#> Duplicates: x2=no y2=no