[Question] Can groupby agg scale in HPAT? #180

bigwater · 2019-09-30T22:49:38Z

Hi,

I am trying to use HPAT to accelerate data science workloads, especially the ETL process.

The data frame I am using contains 21,721,922 rows and 45 columns. All the data entries use float64 dtype. There is no missing data after cleaning.

I was putting the following code into an HPAT decorated function. It simply groups the data frame by ``year" and calculates the average number for each year. I am tracking the execution time of the groupby-agg operator.

    t0 = time.time()
    tmp1 = df.groupby('YEAR')['INCTOT'].mean()
    tt = time.time() - t0

I am using a server with 2 x Intel(R) Xeon(R) CPU E5-2699 v4 CPU, where it has 44 cores in total.

The results look like this:

Baseline is to use Pandas only without HPAT.

Num of cores	groupby-agg time (sec.)
baseline	0.227021694
1	1.437
2	1.39
3	1.398
4	1.427
11	1.51
22	1.794
44	2.838

We observe that when the number of processes used increases, the time spent on groupby-agg also increases. Since GroupBy-agg is a simple map-reduce parallel pattern which should be able to parallelize, the observation is a bit weird to me as far as I understood.

Second, even we only use one thread, applying HPAT gives slowdown compared to pandas.

The groupby-count results of my dataset. Note that in each year, plenty of data entries exist--- there should have sufficient parallelism.

YEAR	count
1970	1486744
1980	8746006
1990	1906165
2000	2199860
2010	2494822

Am I missing something? Could you give some suggestions on how should I do to accelerate the groupby-agg operation using HPAT?

Thank you so much.

Best regards,
Hongyuan Liu

The text was updated successfully, but these errors were encountered:

ghost · 2019-10-01T18:16:53Z

Thank you @bigwater! We're currently working on groupby

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Question] Can groupby agg scale in HPAT? #180

[Question] Can groupby agg scale in HPAT? #180

bigwater commented Sep 30, 2019 •

edited

Loading

ghost commented Oct 1, 2019

[Question] Can groupby agg scale in HPAT? #180

[Question] Can groupby agg scale in HPAT? #180

Comments

bigwater commented Sep 30, 2019 • edited Loading

ghost commented Oct 1, 2019

bigwater commented Sep 30, 2019 •

edited

Loading