Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Issue with class2tree on taxa list with multiple classification levels #934

Closed
meghanmshea opened this issue Aug 23, 2024 · 1 comment
Closed

Comments

@meghanmshea
Copy link

I am combining data across multiple biodiversity survey types, which results in a taxonomic list that has multiple final classification levels (e.g. some organisms are identified to species, some only to genus, etc.). I use the classification function and GBIF database to get standardized taxonomic classification, and then I want to plot a tree to visualize all of the organisms identified (and ultimately, mark which organisms were identified by which survey type).

When I use class2tree, however, the hierarchy seems to get funky: it seems to treat every input classification as the same taxonomic level, so the resulting tree doesn't make taxonomic sense.

Here's a small subset example:

taxa_classification_test<-classification(c(2293068, 2293077, 2293078, 5166486, 2292406, 206, 225), db = "gbif")
taxa_tree_test = class2tree(unique(taxa_classification_test), check = TRUE)

plot(taxa_tree_test)

The resulting plot, for example, treats species and higher level taxonomic classifications the same, which is especially apparent with Haliotis, since the classification list had one genus-level classification, which should be the node above the two species-level classifications -- but here they are all side-by-side.

Screenshot 2024-08-23 at 2 07 09 PM

Looking at the underlying taxa_tree_test, it seems like one issue might be in how taxa_tree_test$classification is being constructed:
Screenshot 2024-08-23 at 2 06 03 PM

As shown, all of the inputs are getting added as "species", even when they are different taxonomic levels.

Is there any way to fix this?

Session Info
R version 4.4.1 (2024-06-14)
Platform: aarch64-apple-darwin20
Running under: macOS Ventura 13.0.1

Matrix products: default
BLAS:   /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib 
LAPACK: /Library/Frameworks/R.framework/Versions/4.4-arm64/Resources/lib/libRlapack.dylib;  LAPACK version 3.12.0

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

time zone: America/Los_Angeles
tzcode source: internal

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] rotl_3.1.0         plotly_4.10.4      ggtreeExtra_1.14.0 ggtree_3.12.0      lubridate_1.9.3    forcats_1.0.0      stringr_1.5.1     
 [8] dplyr_1.1.4        purrr_1.0.2        readr_2.1.5        tidyr_1.3.1        tibble_3.2.1       ggplot2_3.5.1      tidyverse_2.0.0   
[15] taxize_0.9.100     rgbif_3.8.0       

loaded via a namespace (and not attached):
 [1] tidyselect_1.2.1    viridisLite_0.4.2   farver_2.1.2        urltools_1.7.3      fastmap_1.2.0       lazyeval_0.2.2      XML_3.99-0.17      
 [8] digest_0.6.37       timechange_0.3.0    lifecycle_1.0.4     tidytree_0.4.6      magrittr_2.0.3      compiler_4.4.1      progress_1.2.3     
[15] rlang_1.1.4         tools_4.4.1         yaml_2.3.10         igraph_2.0.3        utf8_1.2.4          data.table_1.15.4   phangorn_2.11.1    
[22] conditionz_0.1.0    prettyunits_1.2.0   htmlwidgets_1.6.4   labeling_0.4.3      curl_5.2.1          plyr_1.8.9          xml2_1.3.6         
[29] aplot_0.2.3         httpcode_0.3.0      withr_3.0.1         triebeard_0.4.1     grid_4.4.1          fansi_1.0.6         colorspace_2.1-1   
[36] scales_1.3.0        iterators_1.0.14    crul_1.5.0          cli_3.6.3           crayon_1.5.3        treeio_1.28.0       generics_0.1.3     
[43] rstudioapi_0.16.0   httr_1.4.7          tzdb_0.4.0          ape_5.8             parallel_4.4.1      ggplotify_0.1.2     BiocManager_1.30.24
[50] vctrs_0.6.5         yulab.utils_0.1.6   Matrix_1.7-0        jsonlite_1.8.8      gridGraphics_0.5-1  hms_1.1.3           patchwork_1.2.0    
[57] crosstalk_1.2.1     foreach_1.5.2       ggnewscale_0.5.0    glue_1.7.0          codetools_0.2-20    stringi_1.8.4       gtable_0.3.5       
[64] quadprog_1.5-8      munsell_0.5.1       pillar_1.9.0        rappdirs_0.3.3      htmltools_0.5.8.1   R6_2.5.1            httr2_1.0.2        
[71] bold_1.3.0          oai_0.4.0           lattice_0.22-6      rentrez_1.2.3       ggfun_0.1.5         rncl_0.8.7          Rcpp_1.0.13        
[78] uuid_1.2-1          fastmatch_1.1-4     nlme_3.1-166        whisker_0.4.1       fs_1.6.4            zoo_1.8-12          pkgconfig_2.0.3  
@trvinh
Copy link
Contributor

trvinh commented Aug 24, 2024

Hi @meghanmshea ,

Firstly, the function is implemented so that all input taxa will be displayed as tips of the tree. Therefore, you will see Haliotis plotted at the same level as H. rufescent and H.cracherodii. Secondly, the classification table is used to calculate the distance matrix between input taxa. All NA values in the table are replaced by the previous known value. This approach has been shown to be more effective in resolving the taxonomy tree than simply ignoring the NA values. Because of this, the value of the first rank (species in this case) cannot be NA. Actually, the complete classification table looks like this:

                  fullName                  species       genus     family        order       class     phylum  kingdom
2                 Haliotis                 Haliotis    Haliotis Haliotidae  Lepetellida  Gastropoda   Mollusca Animalia
3     Haliotis cracherodii     Haliotis cracherodii    Haliotis Haliotidae  Lepetellida  Gastropoda   Mollusca Animalia
4       Haliotis rufescens       Haliotis rufescens    Haliotis Haliotidae  Lepetellida  Gastropoda   Mollusca Animalia
5 Megabalanus californicus Megabalanus californicus Megabalanus  Balanidae     Sessilia Maxillopoda Arthropoda Animalia
6   Cadlina luteomarginata   Cadlina luteomarginata     Cadlina Cadlinidae Nudibranchia  Gastropoda   Mollusca Animalia
7                 Anthozoa                 Anthozoa    Anthozoa   Anthozoa     Anthozoa    Anthozoa   Cnidaria Animalia
8               Gastropoda               Gastropoda  Gastropoda Gastropoda   Gastropoda  Gastropoda   Mollusca Animalia

When the table is returned as an output, duplicated values are replaced by NA, which can lead to confusion or wrong (for example, Anthozoa is a class, not a species.

So, a simple solution is, that one should not use a taxon with high taxonomy level as input, when it already belongs to the taxonomy string of other taxa (e.g. Haliotis belongs to the taxonomy string of H. cracherodii, or Gastropoda belongs to the taxonomy string of C. luteomarginata). A smarter solution would be, that the function should remove those taxa automatically :-D

The removed taxa will be showed as the node labels anyway:

image

I hope this helps!

Best,
Vinh

trvinh added a commit to trvinh/taxize that referenced this issue Aug 26, 2024
zachary-foster added a commit that referenced this issue Sep 4, 2024
class2tree checks duplicated taxa in higher levels, resolve #934
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants