Audits a character vector for common data quality issues including missing values, empty strings, whitespace problems, non-ASCII characters, and case inconsistencies. Requires the stringi package (in Suggests).
Usage
diagnose_strings(x, name = NULL)
# S3 method for class 'diagnose_strings'
print(x, ...)Value
An S3 object of class diagnose_strings containing:
- name
Name of the variable
- n_total
Total number of elements
- n_na
Count of NA values
- n_empty
Count of empty strings
- n_whitespace_only
Count of whitespace-only strings
- n_leading_ws
Count of strings with leading whitespace
- n_trailing_ws
Count of strings with trailing whitespace
- n_non_ascii
Count of strings with non-ASCII characters
- n_case_variants
Number of unique values with case variants
- n_case_variant_groups
Number of groups of case-insensitive duplicates
- case_variant_examples
Data.frame with examples of case variants
See also
Other data quality:
audit_transform(),
diagnose_nas(),
get_summary_table(),
summarize_column(),
tab()
Examples
firms <- c("Apple", "APPLE", "apple", " Microsoft ", "Google", NA, "")
diagnose_strings(firms)
#>
#> ── String Column Diagnosis: firms ──────────────────────────────────────────────
#> Total elements: 7
#>
#> Missing & Empty:
#> • NA values: 1 (14.3%)
#> • Empty strings: 1 (14.3%)
#> • Whitespace-only: 0 (0.0%)
#>
#> Whitespace Issues:
#> • Leading whitespace: 1
#> • Trailing whitespace: 1
#>
#> Encoding:
#> • Non-ASCII chars: 0
#>
#> Case Inconsistencies:
#> • Variant groups: 1
#> • Total variants: 3
#>
#> Case variant examples (up to 5 groups):
#> lower n_variants examples
#> apple 3 Apple, APPLE, apple