Skip to content

Commit

Permalink
freeze computation for new projects
Browse files Browse the repository at this point in the history
  • Loading branch information
hturner committed Aug 21, 2023
1 parent cd7d38c commit cefb39c
Show file tree
Hide file tree
Showing 16 changed files with 118 additions and 626 deletions.
16 changes: 16 additions & 0 deletions _freeze/projects/enhance-sample/index/execute-results/html.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,16 @@
{
"hash": "f90a71abe14c9bda0a27ce3c5edf253f",
"result": {
"markdown": "---\ntitle: Enhancing `sample.int` for unequal probability sampling\ndescription: \"\" # Optional short description for post on Discussions thread\nauthor: Ahmadou Dicko and Thomas Lumley\noutput: html_document\ncategories: [C, R, Wishlist, Models]\ncomments:\n giscus:\n repo: \"r-devel/r-project-sprint-2023\"\n repo-id: \"R_kgDOIhAibA\"\n category: \"Proposals\"\n category-id: \"DIC_kwDOIhAibM4CW3GY\"\n mapping: \"title\"\n reactions-enabled: true\n loading: lazy\nbibliography: ref.bib\n---\n\n\n## Problem statement\n\nThe method of unequal probability sampling without replacement, as implemented in `base::sample` (`base::sample.int`), relies on a sequential algorithm (coded in `C`). This algorithm does not respect the prescribed inclusion probabilities (as defined by the `prob` parameter). Consequently, it can produce a biased Horvitz-Thompson estimate.\n\nThe issue can affect other packages as illustrated by this [dplyr issue affecting slice_sample](https://github.com/tidyverse/dplyr/issues/6848).\n\nTo better understand the problem, consider the illustration below, following [@tille2023remarks].\n\n\n::: {.cell}\n\n```{.r .cell-code}\nset.seed(123)\nN <- 12\nn <- 4\np <- (1:N)/sum(1:N)\npik <- n * p\npikest_base <- vector(mode = \"numeric\", length = N)\nnsim <- 1e4\nfor (j in 1:nsim) {\n s <- sample.int(n = N, size = n, replace = FALSE, prob = p)\n pikest_base[s] <- pikest_base[s] + 1\n}\npikest_base <- pikest_base/nsim\n```\n:::\n\n\nAfter estimating the inclusion probabilities with `sample.int`, we can compare them with algorithms from the `sampling` package, as highlighted in its documentation.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nlibrary(sampling)\nset.seed(42)\npikest_maxent <- sapply(1:nsim, \\(i) UPmaxentropy(pik))\npikest_maxent <- rowMeans(pikest_maxent)\n\npikest_pivot <- sapply(1:nsim, \\(i) UPpivotal(pik))\npikest_pivot <- rowMeans(pikest_pivot)\n\ncbind(pik,\n pikest_base,\n pikest_maxent,\n pikest_pivot)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n pik pikest_base pikest_maxent pikest_pivot\n [1,] 0.05128205 0.0587 0.0544 0.0511\n [2,] 0.10256410 0.1194 0.1050 0.1049\n [3,] 0.15384615 0.1699 0.1507 0.1486\n [4,] 0.20512821 0.2279 0.2008 0.2058\n [5,] 0.25641026 0.2778 0.2553 0.2538\n [6,] 0.30769231 0.3233 0.3129 0.3149\n [7,] 0.35897436 0.3627 0.3593 0.3556\n [8,] 0.41025641 0.4181 0.4110 0.4111\n [9,] 0.46153846 0.4652 0.4577 0.4607\n[10,] 0.51282051 0.4967 0.5115 0.5205\n[11,] 0.56410256 0.5189 0.5651 0.5579\n[12,] 0.61538462 0.5614 0.6163 0.6151\n```\n:::\n:::\n\n\nTo evaluate the accuracy of the sampling algorithm, we can compute the following test statistic:\n\n$$\nz_k = \\dfrac{(\\hat{\\pi_k} - \\pi_k) \\sqrt{M}}{\\sqrt{\\pi_k (1 - \\pi_k)}}\n$$\n\nwhere $M$ is the number of simulations.\n\n$z_k$ should be asymptotically normal under the null hypothesis that the sampling algorithm is correct. The implementation is as follows:\n\n\n::: {.cell}\n\n```{.r .cell-code}\ntest_stat <- function(est, pik)\n (est - pik) / sqrt(pik * (1 - pik)/nsim)\n```\n:::\n\n\nWe can compute this test statistics for each method and check its normality using a `qqplot`.\n\n\n::: {.cell}\n::: {.cell-output-display}\n![](index_files/figure-html/qqplot-1.png){width=672}\n:::\n:::\n\n\nThese charts show how `base::sample` deviates more from normality compared to the other two competing algorithms available in the `sampling` package.\n\n## Proposed solution\n\n- **Option 1:**\n - Refine the documentation for `base::sample` and `base::sample.int`.\n - Suggest adding a concise paragraph in the `Details` section, explaining why the function is not suitable for unequal probability sampling based on inclusion probabilities.\n- **Option 2:**\n - Introduce a new `C` function and an additional argument to `sample.int` to incorporate the new method.\n - Retain the original [`ProbSampleNoReplace` `C` function](https://github.com/wch/r-source/blob/7ad1c5790841572c88955b1da805d28174ff0b56/src/main/random.c#L406).\n - By default, `base::sample.int` should utilize `ProbSampleNoReplace` for unequal probability sampling without replacement. This ensures backward compatibility and prevents disruptions to existing code.\n\nIt is hard to pick one algorithm over the multitude of available unequal probability sampling algorithms. In theory, there's no best algorithm for unequal probability. However, the Maximum Entropy Sampling algorithm (`sampling::UPmaxentropy`) also known as Conditional Poisson Sampling is an algorithm with good properties (maximizing entropy) that can be used as a reference.\n\n## Project requirements\n\n- Good understanding of survey sampling algorithms.\n- Good C and R coding skills.\n\n## Project outcomes\n\nAn enhanced version of `sample` (`sample.int`) with a new argument to select the new algorithm for unequal probability sampling.\n\n## References\n\n<div id=\"refs\"></div>\n\n## Reactions and comments\n\n\n```{=html}\n<!-- \nPlease leave the Reactions and comments section\n- a Giscus comment box will be automatically added here \n-->\n```\n",
"supporting": [
"index_files"
],
"filters": [
"rmarkdown/pagebreak.lua"
],
"includes": {},
"engineDependencies": {},
"preserve": {},
"postProcess": true
}
}
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Original file line number Diff line number Diff line change
@@ -0,0 +1,16 @@
{
"hash": "9615435fd4f0df159d3381f0d8ed89f4",
"result": {
"markdown": "---\ntitle: Facilitate Translation for Mac GUI # Warning: changing the title may create a new Discussions thread!\ndescription: \"\" # Optional short description for post on Discussions\nauthor: Simon Urbanek and Michael Chirico # and co-authors if applicable\noutput: html_document\ncategories: [Translations] # use labels from https://github.com/r-devel/r-project-sprint-2023/labels\ncomments:\n giscus: \n repo: \"r-devel/r-project-sprint-2023\"\n repo-id: \"R_kgDOIhAibA\"\n category: \"Proposals\"\n category-id: \"DIC_kwDOIhAibM4CW3GY\"\n mapping: \"title\"\n reactions-enabled: true\n loading: lazy\n---\n\n```{=html}\n<!-- \nThis template is based on https://contributor.r-project.org/r-project-sprint-2023/projects/quartz-alpha-mask/.\nThe sections are provided as a guide and may not be appropriate to your proposal, feel free to skip or change sections.\n\nPlease label any R code chunks, especially those producing images.\n-->\n```\n\n## Background\n\nThe R for Mac GUI is the most comprehensive platform-specific R GUI and has been pioneering a lot of features now present in other GUIs and commercial products. Although officially supported by R Core, the Mac GUI is maintained on the [Mac GUI](https://github.com/R-macos/Mac-GUI) GitHub repository rather than the R Project Subversion repository.\n\nOne aspect of the Mac GUI that has room for improvement is the translation of GUI text into languages other than English.\n\n## Problem statement\n\nThe number of languages currently supported is rather limited (German, Italian, French, Japanese, Dutch) and existing translations may be outdated.\n\nIn addition, contributing translations is not straight-forward. The preferred workflow requires use of Xcode on macOS. An alternative is to collaborate with a maintainer who will send the strings text files to be translated. However, this workflow places extra burden on the maintainers and it is difficult for contributors to identify which strings need updating. Both workflows require contributors to take special care to use UTF-8 encoding to avoid corrupting translation files.\n\n## Proposed solution\n\nThis project will add the Mac GUI strings as a component on the [Weblate server for R](https://translate.rx.studio/projects/r-project/). This would mean that the technical details are taken care of and contributors can focus on translating the text, only requiring access to a browser to contribute.\n\nThe proposed work plan is as follows:\n\n1. Set up Weblate to work with the strings files from the Mac GUI (collaborating with those working on [Weblate Improvements](https://contributor.r-project.org/r-project-sprint-2023/projects/weblate-improvements/)), see the [Weblate docs](https://docs.weblate.org/en/latest/formats/apple.html#apple).\n2. Test the workflow of updating a translation with speakers of currently translated languages at the sprint, e.g. German, French or Japanese.\n3. Add new Left-to-Right languages. Potential additions represented at the sprint: Spanish, Brazilian Portuguese, Nepali.\n4. Add support for Right-to-Left languages, see the [Apple docs](https://developer.apple.com/library/archive/documentation/MacOSX/Conceptual/BPInternational/SupportingRight-To-LeftLanguages/SupportingRight-To-LeftLanguages.html#//apple_ref/doc/uid/10000171i-CH17-SW1).\n5. Begin to add Arabic translations.\n\nTasks 2 and 3 depend on a minimal viable product from task 1 (e.g. an initial set up may rely on some manual steps that could be scripted/automated later). Translators could contribute to other projects until this is ready.\n\nTask 4 can be worked on independently of tasks 1 to 3, but task 5 is dependent on both task 1 and 4. Arabic translators may wish to work on [translations for the Windows GUI](https://translate.rx.studio/projects/r-project/base-r-gui/ar/) in the meantime, which can already be done through Weblate.\n\n## Project requirements\n\n<!-- Include here prerequisite knowledge and any operating system requirements -->\n\nFluent speakers of languages other than English are required for the translation tasks (2, 3 and 5).\n\nThe Weblate set-up will need to be done by a Weblate admin (currently only Gergely Daróczi), but might be supported by a contributor with advanced knowledge of git.\n\nSupport for Right-to-Left languages requires modifying Objective C code with Xcode on macOS. This will likely be done by Simon Urbanek, but might be supported by a macOS user wanting to learn more about development of the Mac GUI.\n\n## Project resources\n\n- [Mac GUI GitHub repository](https://github.com/R-macos/Mac-GUI/)\n\n- [Weblate documentation on Apple strings](https://docs.weblate.org/en/latest/formats/apple.html#apple)\n\n- [Apple developer documentation on RTL language support](https://developer.apple.com/library/archive/documentation/MacOSX/Conceptual/BPInternational/SupportingRight-To-LeftLanguages/SupportingRight-To-LeftLanguages.html#//apple_ref/doc/uid/10000171i-CH17-SW1)\n\n## Project outcomes\n\nContributions to the next release of the Mac GUI.\n\n## Reactions and comments\n\n\n```{=html}\n<!-- \nPlease leave the Reactions and comments section\n- a Giscus comment box will be automatically added here \n-->\n```\n",
"supporting": [
"index_files"
],
"filters": [
"rmarkdown/pagebreak.lua"
],
"includes": {},
"engineDependencies": {},
"preserve": {},
"postProcess": true
}
}
Loading

0 comments on commit cefb39c

Please sign in to comment.