You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
{{ message }}
This repository has been archived by the owner on Feb 2, 2024. It is now read-only.
I am trying to use HPAT to accelerate data science workloads, especially the ETL process.
The data frame I am using contains 21,721,922 rows and 45 columns. All the data entries use float64 dtype. There is no missing data after cleaning.
I was putting the following code into an HPAT decorated function. It simply groups the data frame by ``year" and calculates the average number for each year. I am tracking the execution time of the groupby-agg operator.
I am using a server with 2 x Intel(R) Xeon(R) CPU E5-2699 v4 CPU, where it has 44 cores in total.
The results look like this:
Baseline is to use Pandas only without HPAT.
Num of cores
groupby-agg time (sec.)
baseline
0.227021694
1
1.437
2
1.39
3
1.398
4
1.427
11
1.51
22
1.794
44
2.838
We observe that when the number of processes used increases, the time spent on groupby-agg also increases. Since GroupBy-agg is a simple map-reduce parallel pattern which should be able to parallelize, the observation is a bit weird to me as far as I understood.
Second, even we only use one thread, applying HPAT gives slowdown compared to pandas.
The groupby-count results of my dataset. Note that in each year, plenty of data entries exist--- there should have sufficient parallelism.
YEAR
count
1970
1486744
1980
8746006
1990
1906165
2000
2199860
2010
2494822
Am I missing something? Could you give some suggestions on how should I do to accelerate the groupby-agg operation using HPAT?
Thank you so much.
Best regards,
Hongyuan Liu
The text was updated successfully, but these errors were encountered:
Hi,
I am trying to use HPAT to accelerate data science workloads, especially the ETL process.
The data frame I am using contains 21,721,922 rows and 45 columns. All the data entries use float64 dtype. There is no missing data after cleaning.
I was putting the following code into an HPAT decorated function. It simply groups the data frame by ``year" and calculates the average number for each year. I am tracking the execution time of the groupby-agg operator.
I am using a server with 2 x Intel(R) Xeon(R) CPU E5-2699 v4 CPU, where it has 44 cores in total.
The results look like this:
Baseline is to use Pandas only without HPAT.
We observe that when the number of processes used increases, the time spent on groupby-agg also increases. Since GroupBy-agg is a simple map-reduce parallel pattern which should be able to parallelize, the observation is a bit weird to me as far as I understood.
Second, even we only use one thread, applying HPAT gives slowdown compared to pandas.
The groupby-count results of my dataset. Note that in each year, plenty of data entries exist--- there should have sufficient parallelism.
Am I missing something? Could you give some suggestions on how should I do to accelerate the groupby-agg operation using HPAT?
Thank you so much.
Best regards,
Hongyuan Liu
The text was updated successfully, but these errors were encountered: