diff --git a/previews/PR107/.documenter-siteinfo.json b/previews/PR107/.documenter-siteinfo.json index 182d2be..6b23e73 100644 --- a/previews/PR107/.documenter-siteinfo.json +++ b/previews/PR107/.documenter-siteinfo.json @@ -1 +1 @@ -{"documenter":{"julia_version":"1.11.0","generation_timestamp":"2024-10-14T13:29:29","documenter_version":"1.7.0"}} \ No newline at end of file +{"documenter":{"julia_version":"1.11.0","generation_timestamp":"2024-10-14T13:31:42","documenter_version":"1.7.0"}} \ No newline at end of file diff --git a/previews/PR107/api/index.html b/previews/PR107/api/index.html index 179388b..f9753ef 100644 --- a/previews/PR107/api/index.html +++ b/previews/PR107/api/index.html @@ -1,7 +1,7 @@ API · StreamSampling.jl

API

This is the API page of the package. For a general overview of the functionalities consult the ReadMe.

General Functionalities

StreamSampling.ReservoirSampleType
ReservoirSample{T}([rng], method = AlgRSWRSKIP())
-ReservoirSample{T}([rng], n::Int, method = AlgL(); ordered = false)

Initializes a reservoir sample which can then be fitted with fit!. The first signature represents a sample where only a single element is collected. If ordered is true, the reservoir sample values can be retrived in the order they were collected with ordvalue.

Look at the Sampling Algorithms section for the supported methods.

source
StatsAPI.fit!Function
fit!(rs::AbstractReservoirSample, el)
-fit!(rs::AbstractReservoirSample, el, w)

Updates the reservoir sample by taking into account the element passed. If the sampling is weighted also the weight of the elements needs to be passed.

source
Base.merge!Function
Base.merge!(rs::AbstractReservoirSample, rs::AbstractReservoirSample...)

Updates the first reservoir sample by merging its value with the values of the other samples. Currently only supported for samples with replacement.

source
Base.mergeFunction
Base.merge(rs::AbstractReservoirSample...)

Creates a new reservoir sample by merging the values of the samples passed. Currently only supported for sample with replacement.

source
Base.empty!Function
Base.empty!(rs::AbstractReservoirSample)

Resets the reservoir sample to its initial state. Useful to avoid allocating a new sample in some cases.

source
OnlineStatsBase.valueFunction
value(rs::AbstractReservoirSample)

Returns the elements collected in the sample at the current sampling stage.

Note that even if the sampling respects the schema it is assigned when ReservoirSample is instantiated, some ordering in the sample can be more probable than others. To represent each one with the same probability call shuffle! over the result.

source
StreamSampling.ordvalueFunction
ordvalue(rs::AbstractReservoirSample)

Returns the elements collected in the sample at the current sampling stage in the order they were collected. This applies only when ordered = true is passed in ReservoirSample.

source
StatsAPI.nobsFunction
nobs(rs::AbstractReservoirSample)

Returns the total number of elements that have been observed so far during the sampling process.

source
StreamSampling.StreamSampleType
StreamSample{T}([rng], iter, n, [N], method = AlgD())

Initializes a stream sample, which can then be iterated over to return the sampling elements of the iterable iter which is assumed to have a eltype of T. The methods implemented in StreamSample require the knowledge of the total number of elements in the stream N, if not provided it is assumed to be available by calling length(iter).

source
StreamSampling.itsampleFunction
itsample([rng], iter, method = AlgRSWRSKIP())
+ReservoirSample{T}([rng], n::Int, method = AlgL(); ordered = false)

Initializes a reservoir sample which can then be fitted with fit!. The first signature represents a sample where only a single element is collected. If ordered is true, the reservoir sample values can be retrived in the order they were collected with ordvalue.

Look at the Sampling Algorithms section for the supported methods.

source
StatsAPI.fit!Function
fit!(rs::AbstractReservoirSample, el)
+fit!(rs::AbstractReservoirSample, el, w)

Updates the reservoir sample by taking into account the element passed. If the sampling is weighted also the weight of the elements needs to be passed.

source
Base.merge!Function
Base.merge!(rs::AbstractReservoirSample, rs::AbstractReservoirSample...)

Updates the first reservoir sample by merging its value with the values of the other samples. Currently only supported for samples with replacement.

source
Base.mergeFunction
Base.merge(rs::AbstractReservoirSample...)

Creates a new reservoir sample by merging the values of the samples passed. Currently only supported for sample with replacement.

source
Base.empty!Function
Base.empty!(rs::AbstractReservoirSample)

Resets the reservoir sample to its initial state. Useful to avoid allocating a new sample in some cases.

source
OnlineStatsBase.valueFunction
value(rs::AbstractReservoirSample)

Returns the elements collected in the sample at the current sampling stage.

Note that even if the sampling respects the schema it is assigned when ReservoirSample is instantiated, some ordering in the sample can be more probable than others. To represent each one with the same probability call shuffle! over the result.

source
StreamSampling.ordvalueFunction
ordvalue(rs::AbstractReservoirSample)

Returns the elements collected in the sample at the current sampling stage in the order they were collected. This applies only when ordered = true is passed in ReservoirSample.

source
StatsAPI.nobsFunction
nobs(rs::AbstractReservoirSample)

Returns the total number of elements that have been observed so far during the sampling process.

source
StreamSampling.StreamSampleType
StreamSample{T}([rng], iter, n, [N], method = AlgD())

Initializes a stream sample, which can then be iterated over to return the sampling elements of the iterable iter which is assumed to have a eltype of T. The methods implemented in StreamSample require the knowledge of the total number of elements in the stream N, if not provided it is assumed to be available by calling length(iter).

source
StreamSampling.itsampleFunction
itsample([rng], iter, method = AlgRSWRSKIP())
 itsample([rng], iter, wfunc, method = AlgWRSWRSKIP())

Return a random element of the iterator, optionally specifying a rng (which defaults to Random.default_rng()) and a function wfunc which accept each element as input and outputs the corresponding weight. If the iterator is empty, it returns nothing.


itsample([rng], iter, n::Int, method = AlgL(); ordered = false)
 itsample([rng], iter, wfunc, n::Int, method = AlgAExpJ(); ordered = false)

Return a vector of n random elements of the iterator, optionally specifying a rng (which defaults to Random.default_rng()) a weight function wfunc and a method. ordered dictates whether an ordered sample (also called a sequential sample, i.e. a sample where items appear in the same order as in iter) must be collected.

If the iterator has less than n elements, in the case of sampling without replacement, it returns a vector of those elements.


itsample(rngs, iters, n::Int)
-itsample(rngs, iters, wfuncs, n::Int)

Parallel implementation which returns a sample with replacement of size n from the multiple iterables. All the arguments except from n must be tuples.

source

Sampling Algorithms

StreamSampling.AlgRSWRSKIPType

Implements random reservoir sampling with replacement. To be used with ReservoirSample or itsample.

Adapted fron algorithm RSWR-SKIP described in "Reservoir-based Random Sampling with Replacement from Data Stream, B. Park et al., 2008".

source
StreamSampling.AlgAResType

Implements weighted random reservoir sampling without replacement. To be used with ReservoirSample or itsample.

Adapted from algorithm A-Res described in "Weighted random sampling with a reservoir, P. S. Efraimidis et al., 2006".

source
StreamSampling.AlgAExpJType

Implements weighted random reservoir sampling without replacement. To be used with ReservoirSample or itsample.

Adapted from algorithm A-ExpJ described in "Weighted random sampling with a reservoir, P. S. Efraimidis et al., 2006".

source
StreamSampling.AlgWRSWRSKIPType

Implements weighted random reservoir sampling with replacement. To be used with ReservoirSample or itsample.

Adapted from algorithm WRSWR-SKIP described in "Weighted Reservoir Sampling with Replacement from Multiple Data Streams, A. Meligrana, 2024".

source
StreamSampling.AlgDType

Implements random sampling without replacement. To be used with StreamSample or itsample.

Adapted from algorithm D described in "An Efficient Algorithm for Sequential Random Sampling, J. S. Vitter, 1987".

source
+itsample(rngs, iters, wfuncs, n::Int)

Parallel implementation which returns a sample with replacement of size n from the multiple iterables. All the arguments except from n must be tuples.

source

Sampling Algorithms

StreamSampling.AlgRType

Implements random reservoir sampling without replacement. To be used with ReservoirSample or itsample.

Adapted from algorithm R described in "Random sampling with a reservoir, J. S. Vitter, 1985".

source
StreamSampling.AlgLType

Implements random reservoir sampling without replacement. To be used with ReservoirSample or itsample.

Adapted from algorithm L described in "Random sampling with a reservoir, J. S. Vitter, 1985".

source
StreamSampling.AlgRSWRSKIPType

Implements random reservoir sampling with replacement. To be used with ReservoirSample or itsample.

Adapted fron algorithm RSWR-SKIP described in "Reservoir-based Random Sampling with Replacement from Data Stream, B. Park et al., 2008".

source
StreamSampling.AlgAResType

Implements weighted random reservoir sampling without replacement. To be used with ReservoirSample or itsample.

Adapted from algorithm A-Res described in "Weighted random sampling with a reservoir, P. S. Efraimidis et al., 2006".

source
StreamSampling.AlgAExpJType

Implements weighted random reservoir sampling without replacement. To be used with ReservoirSample or itsample.

Adapted from algorithm A-ExpJ described in "Weighted random sampling with a reservoir, P. S. Efraimidis et al., 2006".

source
StreamSampling.AlgWRSWRSKIPType

Implements weighted random reservoir sampling with replacement. To be used with ReservoirSample or itsample.

Adapted from algorithm WRSWR-SKIP described in "Weighted Reservoir Sampling with Replacement from Multiple Data Streams, A. Meligrana, 2024".

source
StreamSampling.AlgDType

Implements random sampling without replacement. To be used with StreamSample or itsample.

Adapted from algorithm D described in "An Efficient Algorithm for Sequential Random Sampling, J. S. Vitter, 1987".

source
StreamSampling.AlgORDSWRType

Implements random stream sampling with replacement. To be used with StreamSample or itsample.

Adapted from algorithm 4 described in "Generating Sorted Lists of Random Numbers, J. L. Bentley et al., 1980".

source
diff --git a/previews/PR107/benchmark/index.html b/previews/PR107/benchmark/index.html index 1be25e0..86cfd63 100644 --- a/previews/PR107/benchmark/index.html +++ b/previews/PR107/benchmark/index.html @@ -1,2 +1,2 @@ -Benchmark Comparison · StreamSampling.jl

Benchmark Comparison

Using these sampling techniques can bring down considerably the memory usage of the program, but there are cases where they are also more time efficient, as demostrated below with a comparison with the equivalent methods of StatsBase.sample:

image

The “collection-based with setup” methods consider collecting the iterator in memory as part of the benchmark. The code to reproduce this benchmark is in benchmarkcomparisonstream.jl.

+Benchmark Comparison · StreamSampling.jl

Benchmark Comparison

Using these sampling techniques can bring down considerably the memory usage of the program, but there are cases where they are also more time efficient, as demostrated below with a comparison with the equivalent methods of StatsBase.sample:

image

The “collection-based with setup” methods consider collecting the iterator in memory as part of the benchmark. The code to reproduce this benchmark is in benchmarkcomparisonstream.jl.

diff --git a/previews/PR107/example/index.html b/previews/PR107/example/index.html index e6eb6ef..469242b 100644 --- a/previews/PR107/example/index.html +++ b/previews/PR107/example/index.html @@ -16,4 +16,4 @@ end

We use some toy data for illustration

julia> stream = 1:10^8; # the data stream
 
 julia> thr = 2*10^7; # the threshold for the mean monitoring

Then, we run the monitoring

julia> rs = monitor(stream, thr);

The number of observations until the detection is triggered is given by

julia> nobs(rs)
-40009000

which is very close to the true value of 4*10^7 - 1 observations.

Note that in this case we could use an online mean methods, instead of holding all the sample into memory. However, the approach with the sample is more general because it allows to estimate any statistic about the stream.

+40009000

which is very close to the true value of 4*10^7 - 1 observations.

Note that in this case we could use an online mean methods, instead of holding all the sample into memory. However, the approach with the sample is more general because it allows to estimate any statistic about the stream.

diff --git a/previews/PR107/index.html b/previews/PR107/index.html index 52ef11c..f516671 100644 --- a/previews/PR107/index.html +++ b/previews/PR107/index.html @@ -1,5 +1,5 @@ -StreamSampling.jl · StreamSampling.jl

StreamSampling.jl

StreamSamplingModule

StreamSampling.jl

CI codecov Aqua QA DOI

The scope of this package is to provide general methods to sample from any stream in a single pass through the data, even when the number of items contained in the stream is unknown.

This has some advantages over other sampling procedures:

  • If the iterable is lazy, the memory required is a small constant or grows in relation to the size of the sample, instead of the all population.
  • With reservoir methods, the sample collected is a random sample of the portion of the stream seen thus far at any point of the sampling process.
  • In some cases, sampling with the techniques implemented in this library can bring considerable performance gains, since the population of items doesn't need to be previously stored in memory.

For information about the available functionalities consult the documentation.

Contributing

Contributions are welcome! If you encounter any issues, have suggestions for improvements, or would like to add new features, feel free to open an issue or submit a pull request.

source

Overview of the functionalities

The itsample function allows to consume all the stream at once and return the sample collected:

julia> using StreamSampling
+StreamSampling.jl · StreamSampling.jl

StreamSampling.jl

StreamSamplingModule

StreamSampling.jl

CI codecov Aqua QA DOI

The scope of this package is to provide general methods to sample from any stream in a single pass through the data, even when the number of items contained in the stream is unknown.

This has some advantages over other sampling procedures:

  • If the iterable is lazy, the memory required is a small constant or grows in relation to the size of the sample, instead of the all population.
  • With reservoir methods, the sample collected is a random sample of the portion of the stream seen thus far at any point of the sampling process.
  • In some cases, sampling with the techniques implemented in this library can bring considerable performance gains, since the population of items doesn't need to be previously stored in memory.

For information about the available functionalities consult the documentation.

Contributing

Contributions are welcome! If you encounter any issues, have suggestions for improvements, or would like to add new features, feel free to open an issue or submit a pull request.

source

Overview of the functionalities

The itsample function allows to consume all the stream at once and return the sample collected:

julia> using StreamSampling
 
 julia> st = 1:100;
 
@@ -43,4 +43,4 @@
  22
  26
  35
- 75

The advantage of StreamSample iterators in respect to ReservoirSample is that they require O(1) memory if not collected, while reservoir techniques require O(k) memory where k is the number of elements in the sample.

Consult the API page for more information about the package interface.

+ 75

The advantage of StreamSample iterators in respect to ReservoirSample is that they require O(1) memory if not collected, while reservoir techniques require O(k) memory where k is the number of elements in the sample.

Consult the API page for more information about the package interface.