Skip to contents

Audits a character vector for common data quality issues including missing values, empty strings, whitespace problems, non-ASCII characters, and case inconsistencies. Requires the stringi package (in Suggests).

Usage

diagnose_strings(x, name = NULL)

# S3 method for class 'diagnose_strings'
print(x, ...)

Arguments

x

Character vector to diagnose.

name

Optional name for the variable (used in output). If NULL, captures the variable name from the call.

...

Additional arguments (currently unused).

Value

An S3 object of class diagnose_strings containing:

name

Name of the variable

n_total

Total number of elements

n_na

Count of NA values

n_empty

Count of empty strings

n_whitespace_only

Count of whitespace-only strings

n_leading_ws

Count of strings with leading whitespace

n_trailing_ws

Count of strings with trailing whitespace

n_non_ascii

Count of strings with non-ASCII characters

n_case_variants

Number of unique values with case variants

n_case_variant_groups

Number of groups of case-insensitive duplicates

case_variant_examples

Data.frame with examples of case variants

Examples

firms <- c("Apple", "APPLE", "apple", "  Microsoft ", "Google", NA, "")
diagnose_strings(firms)
#> 
#> ── String Column Diagnosis: firms ──────────────────────────────────────────────
#> Total elements: 7
#> 
#> Missing & Empty:
#>  NA values: 1 (14.3%)
#>  Empty strings: 1 (14.3%)
#>  Whitespace-only: 0 (0.0%)
#> 
#> Whitespace Issues:
#>  Leading whitespace: 1
#>  Trailing whitespace: 1
#> 
#> Encoding:
#>  Non-ASCII chars: 0
#> 
#> Case Inconsistencies:
#>  Variant groups: 1
#>  Total variants: 3
#> 
#> Case variant examples (up to 5 groups):
#>  lower n_variants            examples
#>  apple          3 Apple, APPLE, apple