Skip to contents

Analyzes a potential join between two data.frames or tibbles without performing the full join. Reports relationship type (one-to-one, one-to-many, etc.), match rates, duplicate keys, and unmatched rows. Optionally tracks a numeric statistic column through the join to quantify impact.

Usage

validate_join(x, y, by = NULL, stat = NULL, stat_x = NULL, stat_y = NULL)

# S3 method for class 'validate_join'
print(x, ...)

# S3 method for class 'validate_join'
summary(object, ...)

Arguments

x

A data.frame or tibble (left table).

y

A data.frame or tibble (right table).

by

A character vector of column names to join on. Use a named vector c("key_x" = "key_y") when column names differ between tables. Unnamed elements are used for both tables.

stat

Optional single column name (string) to track in both tables when the column name is the same. Ignored if stat_x or stat_y is provided.

stat_x

Optional column name (string) for a numeric statistic in x.

stat_y

Optional column name (string) for a numeric statistic in y.

...

Additional arguments (currently unused).

object

A validate_join object to summarize.

Value

An S3 object of class validate_join containing:

x_name, y_name

Names of the input tables from the original call

by_x, by_y

Key columns used for the join

counts

List with row counts, match rates, and overlap statistics

stat

When stat, stat_x, or stat_y is provided, a list with stat diagnostics per table. NULL when no stat is provided.

duplicates

List with duplicate key information for each table

summary_table

A data.frame summarizing the join diagnostics

relation

Character string describing the relationship

keys_only_in_x

Unmatched keys from x

keys_only_in_y

Unmatched keys from y

Examples

x <- data.frame(id = c(1L, 2L, 3L, 3L), value = c("a", "b", "c", "d"))
y <- data.frame(id = c(2L, 3L, 4L), score = c(10, 20, 30))
result <- validate_join(x, y, by = "id")
print(result)
#> 
#> ── Join Validation: x ↔ y ──────────────────────────────────────────────────────
#> Keys in x: id
#> Keys in y: id
#> 
#>   Item                                   Value
#>   ───────────────────────────────  ───────────
#>   Relationship                     many-to-one
#>   Key(s) in x   [id]                   (1 col)
#>   Key(s) in y   [id]                   (1 col)
#>   Rows in x                                  4
#>   Distinct key combos in x                   3
#>   Rows in y                                  3
#>   Distinct key combos in y                   3
#>   Overlapping distinct key combos            2
#>   Matched row pairs (cartesian)              3
#>   Match rate from x                     75.00%
#>   Match rate from y                     66.67%
#>   Rows only in x (no match in y)             1
#>   Rows only in y (no match in x)             1
#> 
#> Duplicates: x=yes y=no

# Track a stat column with different names in each table
x2 <- data.frame(id = 1:3, sales = c(100, 200, 300))
y2 <- data.frame(id = 2:4, cost = c(10, 20, 30))
validate_join(x2, y2, by = "id", stat_x = "sales", stat_y = "cost")
#> 
#> ── Join Validation: x2 ↔ y2 ────────────────────────────────────────────────────
#> Keys in x2: id
#> Keys in y2: id
#> 
#>   Item                                   Value
#>   ────────────────────────────────  ──────────
#>   Relationship                      one-to-one
#>   Key(s) in x2   [id]                  (1 col)
#>   Key(s) in y2   [id]                  (1 col)
#>   Rows in x2                                 3
#>   Distinct key combos in x2                  3
#>   Rows in y2                                 3
#>   Distinct key combos in y2                  3
#>   Overlapping distinct key combos            2
#>   Matched row pairs (cartesian)              2
#>   Match rate from x2                    66.67%
#>   Match rate from y2                    66.67%
#>   Rows only in x2 (no match in y2)           1
#>   Rows only in y2 (no match in x2)           1
#> 
#> ── Stat diagnostics ────────────────────────────────────────────────────────────
#> 
#> sales in x2:
#>  Total: 600
#>  Matched: 500 (83.33%)
#>  Unmatched: 100 (16.67%)
#> 
#> cost in y2:
#>  Total: 60
#>  Matched: 30 (50.00%)
#>  Unmatched: 30 (50.00%)
#> 
#> Duplicates: x2=no y2=no