-
Notifications
You must be signed in to change notification settings - Fork 56
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Prepare for a new version #145
base: master
Are you sure you want to change the base?
Changes from all commits
8835c68
56d065c
4aac673
4bdf2a2
1513803
ec63f2a
85dae81
9e59d00
05f2748
30ad0b0
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change | ||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
|
@@ -5,15 +5,21 @@ | |||||||||||||
The RDatasets package provides an easy way for Julia users to experiment with most of the standard data sets that are available in the core of R as well as datasets included with many of R's most popular packages. This package is essentially a simplistic port of the Rdatasets repo created by Vincent Arelbundock, who conveniently gathered data sets from many of the standard R packages in one convenient location on GitHub at https://github.com/vincentarelbundock/Rdatasets | ||||||||||||||
|
||||||||||||||
In order to load one of the data sets included in the RDatasets package, you will need to have the `DataFrames` package installed. This package is automatically installed as a dependency of the `RDatasets` package if you install `RDatasets` as follows: | ||||||||||||||
|
||||||||||||||
Pkg.add("RDatasets") | ||||||||||||||
|
||||||||||||||
```julia | ||||||||||||||
Pkg.add("RDatasets") | ||||||||||||||
``` | ||||||||||||||
After installing the RDatasets package, you can then load data sets using the `dataset()` function, which takes the name of a package and a data set as arguments: | ||||||||||||||
|
||||||||||||||
using RDatasets | ||||||||||||||
iris = dataset("datasets", "iris") | ||||||||||||||
neuro = dataset("boot", "neuro") | ||||||||||||||
|
||||||||||||||
```julia | ||||||||||||||
using RDatasets | ||||||||||||||
iris = dataset("datasets", "iris") | ||||||||||||||
neuro = dataset("boot", "neuro") | ||||||||||||||
``` | ||||||||||||||
You can also get descriptions of the datasets by calling `RDatasets.description`: | ||||||||||||||
```julia | ||||||||||||||
RDatasets.description("datasets", "iris") | ||||||||||||||
# or | ||||||||||||||
RDatasets.description(iris) # only use this on DataFrames returned from `dataset`! | ||||||||||||||
``` | ||||||||||||||
Comment on lines
+21
to
+22
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Suggested change
|
||||||||||||||
# Data Sets | ||||||||||||||
|
||||||||||||||
The `RDatasets.packages()` function returns a table of represented R packages: | ||||||||||||||
|
@@ -74,6 +80,23 @@ mlmRev|guImmun|Immunization in Guatemala|2159|13 | |||||||||||||
mlmRev|guPrenat|Prenatal care in Guatemala|2449|15 | ||||||||||||||
mlmRev|star|Student Teacher Achievement Ratio (STAR) project data|26796|18 | ||||||||||||||
|
||||||||||||||
# How to add datasets from a new package | ||||||||||||||
|
||||||||||||||
**Step 1: add the data from the package** | ||||||||||||||
|
||||||||||||||
1. In your clone of this repo `mkdir -p data/$PKG` | ||||||||||||||
2. Go to CRAN | ||||||||||||||
3. Download the *source package* | ||||||||||||||
4. Extract one or more of the datasets in the `data` directory into the new directory | ||||||||||||||
|
||||||||||||||
**Step 2: add the metadata** | ||||||||||||||
|
||||||||||||||
Run the script: | ||||||||||||||
|
||||||||||||||
$ scripts/update_doc_one.sh $PKG | ||||||||||||||
|
||||||||||||||
Now it's ready for you to submit your pull request. | ||||||||||||||
|
||||||||||||||
# Licensing and Intellectual Property | ||||||||||||||
|
||||||||||||||
Following Vincent's lead, we have assumed that all of the data sets in this repository can be made available under the GPL-3 license. If you know that one of the datasets released here should not be released publicly or if you know that a data set can only be released under a different license, please contact me so that I can remove the data set from this repository. |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,4 @@ | ||
R --no-save <<END | ||
source("src/update_doc.r") | ||
update_docs(".") | ||
END |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,4 @@ | ||
R --no-save <<END | ||
source("src/update_doc.r") | ||
update_package_doc(".", "$1") | ||
END |
Original file line number | Diff line number | Diff line change | ||||||||
---|---|---|---|---|---|---|---|---|---|---|
|
@@ -6,19 +6,151 @@ const Dataset_typedetect_rows = Dict{Tuple{String, String}, Union{Vector,Dict}}( | |||||||||
|
||||||||||
function dataset(package_name::AbstractString, dataset_name::AbstractString) | ||||||||||
basename = joinpath(@__DIR__, "..", "data", package_name) | ||||||||||
|
||||||||||
# First, identify possible files | ||||||||||
rdataname = joinpath(basename, string(dataset_name, ".RData")) | ||||||||||
rdaname = joinpath(basename, string(dataset_name, ".rda")) | ||||||||||
if isfile(rdaname) | ||||||||||
return load(rdaname)[dataset_name] | ||||||||||
end | ||||||||||
|
||||||||||
csvname = joinpath(basename, string(dataset_name, ".csv.gz")) | ||||||||||
if isfile(csvname) | ||||||||||
return open(csvname,"r") do io | ||||||||||
# Then, check to see which exists. If none exist, error. | ||||||||||
dataset = if isfile(rdataname) | ||||||||||
load(rdataname)[dataset_name] | ||||||||||
elseif isfile(rdaname) | ||||||||||
load(rdaname)[dataset_name] | ||||||||||
elseif isfile(csvname) | ||||||||||
open(csvname,"r") do io | ||||||||||
uncompressed = IOBuffer(read(GzipDecompressorStream(io))) | ||||||||||
DataFrame(CSV.File(uncompressed, delim=',', quotechar='\"', missingstring="NA", | ||||||||||
types=get(Dataset_typedetect_rows, (package_name, dataset_name), nothing)) ) | ||||||||||
end | ||||||||||
else | ||||||||||
error("Unable to locate dataset file $rdaname or $csvname") | ||||||||||
end | ||||||||||
# Finally, inject metadata into the dataframe to indicate origin: | ||||||||||
DataFrames.metadata!(dataset, "RDatasets.jl", (string(package_name), string(dataset_name))) | ||||||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Not needed AFAICT:
Suggested change
|
||||||||||
return dataset | ||||||||||
end | ||||||||||
|
||||||||||
|
||||||||||
""" | ||||||||||
RDatasets.description(package_name::AbstractString, dataset_name::AbstractString) | ||||||||||
RDatasets.description(df::DataFrame) # only call this on dataframes from RDatasets! | ||||||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Put this information in the docstring body instead. Also say what happens if that's not the case.
Suggested change
|
||||||||||
|
||||||||||
Returns an `RDatasetDescription` object containing the description of the dataset. | ||||||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Suggested change
|
||||||||||
|
||||||||||
Invoke this function in exactly the same way you would invoke `dataset` to get the dataset itself. | ||||||||||
|
||||||||||
This object prints well in the REPL, and can also be shown as markdown or HTML. | ||||||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Suggested change
|
||||||||||
|
||||||||||
!!! note Unexported | ||||||||||
This function is left deliberately unexported, since the name is pretty common. | ||||||||||
Comment on lines
+42
to
+44
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This isn't a standard pattern AFAIK. Better mark the function as public via
Suggested change
|
||||||||||
""" | ||||||||||
function description(package_name::AbstractString, dataset_name::AbstractString) | ||||||||||
doc_html_file = joinpath(@__DIR__, "..", "doc", package_name, "$dataset_name.html") | ||||||||||
if isfile(doc_html_file) | ||||||||||
return RDatasetDescription(read(doc_html_file, String)) | ||||||||||
else | ||||||||||
return RDatasetDescription("No description available.") | ||||||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I'd distinguish two cases:
|
||||||||||
end | ||||||||||
end | ||||||||||
|
||||||||||
# This is a convenience function to get the description of a dataset from a DataFrame. | ||||||||||
# Since we set metadata on the DataFrame, we can use this to get the description, | ||||||||||
# if it exists. | ||||||||||
function description(df::AbstractDataFrame) | ||||||||||
if "RDatasets.jl" in DataFrames.metadatakeys(df) | ||||||||||
package_name, dataset_name = DataFrames.metadata(df, "RDatasets.jl") | ||||||||||
Comment on lines
+59
to
+60
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Suggested change
|
||||||||||
return description(package_name, dataset_name) | ||||||||||
else | ||||||||||
@warn "No metadata indicating dataset origin found. Returning default description." | ||||||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Throwing a warning is never a good pattern IMO as they can't be turned off easily. Either this is a problem and we should throw an error, or it's OK for users to rely on this and it should succeed silently. In this case I'd throw an error, possibly with an argument like |
||||||||||
return RDatasetDescription("No description available.") | ||||||||||
end | ||||||||||
end | ||||||||||
|
||||||||||
""" | ||||||||||
RDatasetDescription(content::String) | ||||||||||
|
||||||||||
A type to hold the content of a dataset description. | ||||||||||
|
||||||||||
The main purpose of its existence is to provide a way to display the content | ||||||||||
differently in HTML and markdown contexts. | ||||||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Suggested change
|
||||||||||
|
||||||||||
Invoked through [`RDatasets.description`](@ref). | ||||||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Suggested change
|
||||||||||
""" | ||||||||||
struct RDatasetDescription | ||||||||||
content::String | ||||||||||
end | ||||||||||
|
||||||||||
function Base.show(io::IO, mime::MIME"text/plain", d::RDatasetDescription) | ||||||||||
s = description_to_markdown(d.content) | ||||||||||
# Here, we show a Markdown.jl object, which the REPL can render correctly | ||||||||||
# as markdown, as it does in help-mode. | ||||||||||
show(io, mime, Markdown.parse(s)) | ||||||||||
end | ||||||||||
function Base.show(io::IO, mime::MIME"text/markdown", d::RDatasetDescription) | ||||||||||
s = description_to_markdown(d.content) | ||||||||||
# Here, we return a Markdown string directly. This is useful for e.g. documentation, | ||||||||||
# where we want to render the markdown as HTML. | ||||||||||
show(io, mime, s) | ||||||||||
end | ||||||||||
# This returns raw HTML documentation. | ||||||||||
function Base.show(io::IO, mime::MIME"text/html", d::RDatasetDescription) | ||||||||||
show(io, mime, Docs.HTML(d.content)) | ||||||||||
end | ||||||||||
|
||||||||||
|
||||||||||
""" | ||||||||||
description_to_markdown(string::String) | ||||||||||
|
||||||||||
Converts an HTML string to markdown. This function is written specifically | ||||||||||
for HTML descriptions in RDatasets.jl, and so is a bit opinionated on what to | ||||||||||
replace, etc. | ||||||||||
|
||||||||||
It replaces all known HTML tags using regex, and then removes all other HTML tags. | ||||||||||
|
||||||||||
## Behaviour | ||||||||||
|
||||||||||
Currently, it handles the following HTML tags: | ||||||||||
- `<h1>`, `<h2>`, `<h3>`, `<h4>`, `<h5>`, `<h6>` -> `#`, `##`, `###`, `####`, `#####`, `######` | ||||||||||
- `<title>` -> `#` | ||||||||||
- `<code>` -> `` `code` `` | ||||||||||
- `<pre>` -> "```R\\npre\\n```" | ||||||||||
- `<EM>` -> `*EM*` | ||||||||||
- `<B>` -> `**B**` | ||||||||||
- `–` -> `-` | ||||||||||
|
||||||||||
## TODOs | ||||||||||
|
||||||||||
- Tables | ||||||||||
- Links | ||||||||||
- Images | ||||||||||
""" | ||||||||||
function description_to_markdown(string) | ||||||||||
html_header_regex = r"<h(?'hnum'\d)>(?'content'[^<]+)<\/h\g'hnum'>" | ||||||||||
function regexmatch2md(matched_string) | ||||||||||
m = match(html_header_regex, matched_string) | ||||||||||
if isnothing(m.captures[1]) || isnothing(m.captures[2]) | ||||||||||
return matched_string | ||||||||||
end | ||||||||||
|
||||||||||
hnum = parse(Int, m[:hnum]) | ||||||||||
content = m[:content] | ||||||||||
|
||||||||||
return join(("\n", "#"^hnum, " ", content, "\n\n")) | ||||||||||
end | ||||||||||
error("Unable to locate dataset file $rdaname or $csvname") | ||||||||||
title_matcher_regex = r"<title>(?'content'[^<]+)<\/title>" | ||||||||||
code_matcher_regex = r"<code>(?'content'[^<]+)<\/code>" | ||||||||||
pre_matcher_regex = r"<pre>(?'content'[^<]+)<\/pre>" | ||||||||||
emph_matcher_regex = r"<(?i)EM(?-i)>(?'content'[^<]+)<\/(?i)EM(?-i)>" | ||||||||||
b_matcher_regex = r"<(?i)B(?-i)>(?'content'[^<]+)<\/(?i)B(?-i)>" | ||||||||||
new_string = replace( | ||||||||||
string, | ||||||||||
html_header_regex => regexmatch2md, | ||||||||||
title_matcher_regex => titlestr -> "# " * match(title_matcher_regex, titlestr)[:content], | ||||||||||
code_matcher_regex => codestr -> "`" * match(code_matcher_regex, codestr)[:content] * "`", | ||||||||||
pre_matcher_regex => prestr -> "\n```R\n" * match(pre_matcher_regex, prestr)[:content] * "\n```\n", | ||||||||||
emph_matcher_regex => emphstr -> "*" * match(emph_matcher_regex, emphstr)[:content] * "*", | ||||||||||
b_matcher_regex => bstr -> "**" * match(b_matcher_regex, bstr)[:content] * "**", | ||||||||||
"–" => "-", | ||||||||||
) | ||||||||||
nohtml = replace(new_string, Regex("<[^>]*>") => "") | ||||||||||
return replace(nohtml, Regex("\n\n+") => "\n\n") | ||||||||||
end |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Probably a good occasion to tag 1.0.0. Clearly the package is stable enough.