Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Some json files are not valid UTF8 #56

Open
ablack3 opened this issue Jun 20, 2023 · 2 comments
Open

Some json files are not valid UTF8 #56

ablack3 opened this issue Jun 20, 2023 · 2 comments

Comments

@ablack3
Copy link

ablack3 commented Jun 20, 2023

Are the json files in the Phenotype library supposed to be valid UTF-8?
If not what encoding is used?

library(PhenotypeLibrary)

listPhenotypes()$cohortId |>
  getPlCohortDefinitionSet() |>
  dplyr::filter(!validUTF8(json)) 
#> # A tibble: 5 × 4
#>   cohortId cohortName                                              json    sql  
#>      <dbl> <chr>                                                   <chr>   <chr>
#> 1        6 [P] Fever (3Pe, 30Era)                                  "{\n\t… "CRE…
#> 2       16 [P] Exposure to SARS-Cov 2 and coronavirus (7Pe, 30Era) "{\n\t… "CRE…
#> 3       29 [W] Autoimmune condition (FP)                           "{\n\t… "CRE…
#> 4       64 [P] Flu-like symptoms (3P, 30Era)                       "{\n\t… "CRE…
#> 5       73 [W] Pregnancy (270P, 0Era)                              "{\n\t… "CRE…

Created on 2023-06-20 with reprex v2.0.2

Session info
sessioninfo::session_info()
#> ─ Session info ───────────────────────────────────────────────────────────────
#>  setting  value
#>  version  R version 4.2.2 (2022-10-31)
#>  os       macOS Big Sur ... 10.16
#>  system   x86_64, darwin17.0
#>  ui       X11
#>  language (EN)
#>  collate  en_US.UTF-8
#>  ctype    en_US.UTF-8
#>  tz       Europe/London
#>  date     2023-06-20
#>  pandoc   3.1.1 @ /Applications/RStudio.app/Contents/Resources/app/quarto/bin/tools/ (via rmarkdown)
#> 
#> ─ Packages ───────────────────────────────────────────────────────────────────
#>  package          * version date (UTC) lib source
#>  backports          1.4.1   2021-12-13 [1] CRAN (R 4.2.0)
#>  bit                4.0.5   2022-11-15 [1] CRAN (R 4.2.0)
#>  bit64              4.0.5   2020-08-30 [1] CRAN (R 4.2.0)
#>  checkmate          2.2.0   2023-04-27 [1] CRAN (R 4.2.0)
#>  cli                3.6.1   2023-03-23 [1] CRAN (R 4.2.0)
#>  crayon             1.5.2   2022-09-29 [1] CRAN (R 4.2.0)
#>  digest             0.6.31  2022-12-11 [1] CRAN (R 4.2.0)
#>  dplyr              1.1.2   2023-04-20 [1] CRAN (R 4.2.0)
#>  evaluate           0.21    2023-05-05 [1] CRAN (R 4.2.0)
#>  fansi              1.0.4   2023-01-22 [1] CRAN (R 4.2.0)
#>  fastmap            1.1.1   2023-02-24 [1] CRAN (R 4.2.0)
#>  fs                 1.6.2   2023-04-25 [1] CRAN (R 4.2.0)
#>  generics           0.1.3   2022-07-05 [1] CRAN (R 4.2.0)
#>  glue               1.6.2   2022-02-24 [1] CRAN (R 4.2.0)
#>  hms                1.1.3   2023-03-21 [1] CRAN (R 4.2.0)
#>  htmltools          0.5.5   2023-03-23 [1] CRAN (R 4.2.0)
#>  knitr              1.43    2023-05-25 [1] CRAN (R 4.2.0)
#>  lifecycle          1.0.3   2022-10-07 [1] CRAN (R 4.2.0)
#>  magrittr           2.0.3   2022-03-30 [1] CRAN (R 4.2.0)
#>  PhenotypeLibrary * 3.2.0   2023-06-20 [1] local
#>  pillar             1.9.0   2023-03-22 [1] CRAN (R 4.2.0)
#>  pkgconfig          2.0.3   2019-09-22 [1] CRAN (R 4.2.0)
#>  purrr              1.0.1   2023-01-10 [1] CRAN (R 4.2.0)
#>  R.cache            0.16.0  2022-07-21 [1] CRAN (R 4.2.0)
#>  R.methodsS3        1.8.2   2022-06-13 [1] CRAN (R 4.2.0)
#>  R.oo               1.25.0  2022-06-12 [1] CRAN (R 4.2.0)
#>  R.utils            2.12.2  2022-11-11 [1] CRAN (R 4.2.0)
#>  R6                 2.5.1   2021-08-19 [1] CRAN (R 4.2.0)
#>  readr              2.1.4   2023-02-10 [1] CRAN (R 4.2.0)
#>  reprex             2.0.2   2022-08-17 [1] CRAN (R 4.2.0)
#>  rlang              1.1.1   2023-04-28 [1] CRAN (R 4.2.0)
#>  rmarkdown          2.22    2023-06-01 [1] CRAN (R 4.2.0)
#>  rstudioapi         0.14    2022-08-22 [1] CRAN (R 4.2.0)
#>  sessioninfo        1.2.2   2021-12-06 [1] CRAN (R 4.2.0)
#>  styler             1.10.1  2023-06-05 [1] CRAN (R 4.2.0)
#>  tibble             3.2.1   2023-03-20 [1] CRAN (R 4.2.0)
#>  tidyselect         1.2.0   2022-10-10 [1] CRAN (R 4.2.0)
#>  tzdb               0.4.0   2023-05-12 [1] CRAN (R 4.2.0)
#>  utf8               1.2.3   2023-01-31 [1] CRAN (R 4.2.0)
#>  vctrs              0.6.2   2023-04-19 [1] CRAN (R 4.2.0)
#>  vroom              1.6.3   2023-04-28 [1] CRAN (R 4.2.0)
#>  withr              2.5.0   2022-03-03 [1] CRAN (R 4.2.0)
#>  xfun               0.39    2023-04-20 [1] CRAN (R 4.2.0)
#>  yaml               2.3.7   2023-01-23 [1] CRAN (R 4.2.0)
#> 
#>  [1] /Library/Frameworks/R.framework/Versions/4.2/Resources/library
#> 
#> ──────────────────────────────────────────────────────────────────────────────
@gowthamrao
Copy link
Member

Hi @ablack3 i am not an expert on this topic, so i will need some help from you to understand.

All it is doing is sourcing the cohort json from a security enabled atlas (atlas-phenotype.ohdsi.org) using this script https://github.com/OHDSI/PhenotypeLibrary/blob/main/extras/UpdatePhenotypes.R

So whatever format the ROhdsiWebapi gives it from that atlas - is saved.

Maybe using this information, you can help me understand what is UTF-8

@Zhimin-arya
Copy link

I encountered the same issue as well when I run generateCohortSet through all the cohorts, the R console threw the error message: invalid UTF-8 input in readChar(). My workaround is to manually open the corresponding cohorts json file with txt editor and save it again with UTF-8 encoding. The current cohorts which have utf-8 encoding issue are: 1029, 1215, 1229, 235, 29, 504, 633, 645. The PhenotypeLibrary version I am using is ‘3.32.0’.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants