Skip to content

Commit

Permalink
Merge branch 'master' into Hossein/FastHierVisitor
Browse files Browse the repository at this point in the history
  • Loading branch information
hosseinmoein committed Nov 13, 2024
2 parents 101e81c + 6bf1500 commit 3a90b94
Show file tree
Hide file tree
Showing 10 changed files with 1,836 additions and 38 deletions.
2 changes: 1 addition & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -56,7 +56,7 @@ I have followed a few <B>principles in this library</B>:<BR>

### Performance
You have probably heard of Polars DataFrame. It is implemented in Rust and ported with zero-overhead to Python (as long as you don’t have a loop). I have been asked by many people to write a comparison for <B>DataFrame vs. Polars</B>. So, I finally found some time to learn a bit about Polars and write a very simple benchmark.<BR>
I wrote the following identical programs for both Polars and C++ DataFrame (and Pandas). I used Polars version: 0.19.14 (Pandas version: 1.5.3, Numpy version: 1.24.2). And I used C++20 clang compiler with -O3 option. I ran both on my, somewhat outdated, MacBook Pro.<BR>
I wrote the following identical programs for both Polars and C++ DataFrame (and Pandas). I used Polars version: 0.19.14 (Pandas version: 1.5.3, Numpy version: 1.24.2). And I used C++20 clang compiler with -O3 option. I ran both on my, somewhat outdated, MacBook Pro (Intel chip, 96GB RAM).<BR>
In both cases, I created a dataframe with 3 random columns. The C++ DataFrame also required an additional index column of the same size. Polars doesn’t believe in index columns (that has its own pros and cons. I am not going through it here).
Each program has three identical parts. First it generates and populates 3 columns with 300m random numbers each (in case of C++ DataFrame, it must also generate a sequential index column of the same size). That is the part I am _not_ interested in. In the second part, it calculates the mean of the first column, the variance of the second column, and the Pearson correlation of the second and third columns. In the third part, it does a select (or filter as Polars calls it) on one of the columns.

Expand Down
10 changes: 9 additions & 1 deletion docs/HTML/DataFrame.html
Original file line number Diff line number Diff line change
Expand Up @@ -622,7 +622,7 @@ <H2 ID="2"><font color="blue">API Reference with code samples <font size="+4">&#
</tr>

<tr class="item" onmouseover="this.style.backgroundColor='#ffff66';" onmouseout="this.style.backgroundColor='#d4e3e5';">
<td title="These are other functionalities of DataFrame" style="text-align:center;background-color:LightGrey;color:DarkBlue">Gears &nbsp;&nbsp; <font size="+3">&#x2699;</font></td>
<td title="These are other functionalities of DataFrame" style="text-align:center;background-color:LightGrey;color:DarkBlue">Gears &amp; Stuff &nbsp;&nbsp; <font size="+3">&#x2699;</font></td>
</tr>

<tr class="item" onmouseover="this.style.backgroundColor='#ffff66';" onmouseout="this.style.backgroundColor='#d4e3e5';">
Expand Down Expand Up @@ -946,6 +946,14 @@ <H2 ID="2"><font color="blue">API Reference with code samples <font size="+4">&#
<td title="Calculates the diff between shifted values">struct <a href="https://htmlpreview.github.io/?https://github.com/hosseinmoein/DataFrame/blob/master/docs/HTML/DiffVisitor.html">DiffVisitor</a>{}</td>
</tr>

<tr class="item" onmouseover="this.style.backgroundColor='#ffff66';" onmouseout="this.style.backgroundColor='#d4e3e5';">
<td title="Gives you the first dataitem in the given column">struct <a href="https://htmlpreview.github.io/?https://github.com/hosseinmoein/DataFrame/blob/master/docs/HTML/FirstVisitor.html">FirstVisitor</a>{}</td>
</tr>

<tr class="item" onmouseover="this.style.backgroundColor='#ffff66';" onmouseout="this.style.backgroundColor='#d4e3e5';">
<td title="Gives you the last dataitem in the given column">struct <a href="https://htmlpreview.github.io/?https://github.com/hosseinmoein/DataFrame/blob/master/docs/HTML/FirstVisitor.html">LastVisitor</a>{}</td>
</tr>

<tr class="item" onmouseover="this.style.backgroundColor='#ffff66';" onmouseout="this.style.backgroundColor='#d4e3e5';">
<td title="Calculates product">struct <a href="https://htmlpreview.github.io/?https://github.com/hosseinmoein/DataFrame/blob/master/docs/HTML/ProdVisitor.html">ProdVisitor</a>{}</td>
</tr>
Expand Down
168 changes: 168 additions & 0 deletions docs/HTML/FirstVisitor.html

Large diffs are not rendered by default.

2 changes: 1 addition & 1 deletion docs/HTML/self_contained.html
Original file line number Diff line number Diff line change
Expand Up @@ -51,7 +51,7 @@
It also has some disadvantages:
<UL>
<LI>There might be functionalities that are hard/time-consuming to implement that are already there</LI>
<LI>If you find a battle-test library, the debugging is already done for you</LI>
<LI>If you find a battle-tested library, the debugging is already done for you</LI>
<LI>There might be industry-wide standards/trends that you want to follow by using a reputed library</LI>
</UL>
<BR>
Expand Down
62 changes: 31 additions & 31 deletions include/DataFrame/DataFrameStatsVisitors.h
Original file line number Diff line number Diff line change
Expand Up @@ -6382,16 +6382,12 @@ struct LinearFitVisitor {
const H &x_begin, const H &x_end,
const H &y_begin, const H &y_end) {

const size_type col_s = std::distance(x_begin, x_end);
const size_type col_s =
std::min(std::distance(x_begin, x_end),
std::distance(y_begin, y_end));
const auto thread_level = (col_s < ThreadPool::MUL_THR_THHOLD)
? 0L : ThreadGranularity::get_thread_level();

#ifdef HMDF_SANITY_EXCEPTIONS
if (col_s != size_type(std::distance(y_begin, y_end)))
throw DataFrameError("LinearFitVisitor: two columns must be "
"of equal sizes");
#endif // HMDF_SANITY_EXCEPTIONS

value_type sum_x { 0 }; // Sum of all observed x
value_type sum_y { 0 }; // Sum of all observed y
value_type sum_x2 { 0 }; // Sum of all observed x squared
Expand Down Expand Up @@ -7543,30 +7539,32 @@ is_normal(const V &column, double epsl, bool check_for_standard) {
svisit.post();

const value_type mean = static_cast<value_type>(svisit.get_mean());
const value_type std = static_cast<value_type>(svisit.get_std());
const value_type high_band_1 = static_cast<value_type>(mean + std);
const value_type low_band_1 = static_cast<value_type>(mean - std);
const value_type stdev = static_cast<value_type>(svisit.get_std());
const value_type high_band_1 = static_cast<value_type>(mean + stdev);
const value_type low_band_1 = static_cast<value_type>(mean - stdev);
double count_1 = 0.0;
const value_type high_band_2 =
static_cast<value_type>(mean + std * 2.0);
const value_type low_band_2 = static_cast<value_type>(mean - std * 2.0);
static_cast<value_type>(mean + stdev * 2.0);
const value_type low_band_2 =
static_cast<value_type>(mean - stdev * 2.0);
double count_2 = 0.0;
const value_type high_band_3 =
static_cast<value_type>(mean + std * 3.0);
const value_type low_band_3 = static_cast<value_type>(mean - std * 3.0);
static_cast<value_type>(mean + stdev * 3.0);
const value_type low_band_3 =
static_cast<value_type>(mean - stdev * 3.0);
double count_3 = 0.0;

for (auto citer : column) [[likely]] {
if (citer >= low_band_1 && citer < high_band_1) {
for (const auto &val : column) [[likely]] {
if (val >= low_band_1 && val < high_band_1) {
count_3 += 1;
count_2 += 1;
count_1 += 1;
}
else if (citer >= low_band_2 && citer < high_band_2) {
else if (val >= low_band_2 && val < high_band_2) {
count_3 += 1;
count_2 += 1;
}
else if (citer >= low_band_3 && citer < high_band_3) {
else if (val >= low_band_3 && val < high_band_3) {
count_3 += 1;
}
}
Expand All @@ -7578,7 +7576,7 @@ is_normal(const V &column, double epsl, bool check_for_standard) {
std::fabs((count_3 / col_s) - 0.997) <= epsl) {
if (check_for_standard)
return (std::fabs(mean - 0) <= epsl &&
std::fabs(std - 1.0) <= epsl);
std::fabs(stdev - 1.0) <= epsl);
return (true);
}
return (false);
Expand All @@ -7597,28 +7595,30 @@ is_lognormal(const V &column, double epsl) {
StatsVisitor<value_type, int> log_visit;

svisit.pre();
for (auto citer : column) [[likely]] {
svisit(dummy_idx, static_cast<value_type>(std::log(citer)));
log_visit(dummy_idx, citer);
for (auto val : column) [[likely]] {
svisit(dummy_idx, static_cast<value_type>(std::log(val)));
log_visit(dummy_idx, val);
}
svisit.post();

const value_type mean = static_cast<value_type>(svisit.get_mean());
const value_type std = static_cast<value_type>(svisit.get_std());
const value_type high_band_1 = static_cast<value_type>(mean + std);
const value_type low_band_1 = static_cast<value_type>(mean - std);
const value_type stdev = static_cast<value_type>(svisit.get_std());
const value_type high_band_1 = static_cast<value_type>(mean + stdev);
const value_type low_band_1 = static_cast<value_type>(mean - stdev);
double count_1 = 0.0;
const value_type high_band_2 =
static_cast<value_type>(mean + std * 2.0);
const value_type low_band_2 = static_cast<value_type>(mean - std * 2.0);
static_cast<value_type>(mean + stdev * 2.0);
const value_type low_band_2 =
static_cast<value_type>(mean - stdev * 2.0);
double count_2 = 0.0;
const value_type high_band_3 =
static_cast<value_type>(mean + std * 3.0);
const value_type low_band_3 = static_cast<value_type>(mean - std * 3.0);
static_cast<value_type>(mean + stdev * 3.0);
const value_type low_band_3 =
static_cast<value_type>(mean - stdev * 3.0);
double count_3 = 0.0;

for (auto citer : column) [[likely]] {
const auto log_val = std::log(citer);
for (const auto &val : column) [[likely]] {
const auto log_val = std::log(val);

if (log_val >= low_band_1 && log_val < high_band_1) {
count_3 += 1;
Expand Down
Loading

0 comments on commit 3a90b94

Please sign in to comment.