forked from RunningJon/TPC-DS
-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathREADME.txt
240 lines (206 loc) · 9.87 KB
/
README.txt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
########################################################################################
TPC-DS benchmark scripts for HAWQ and Greenplum database.
########################################################################################
Supported versions:
Greenplum 4.2, 4.3, 5.0
HAWQ 1.3, 2.0, 2.1, 2.2
########################################################################################
TPC-DS Information
########################################################################################
Version 2.x now uses TPC-DS version 2.1 for the standard queries.
Version 2.2.x now supports Greenplum version 5.x.
########################################################################################
Query Options
########################################################################################
You can use one of the three types of queries in these tests:
1. tpcds: This is the standard TPC-DS queries that are generated by dsqgen based on
the size of the benchmark you are using.
2. e9: This is the E9 version of the TPC-DS queries which are static and based on
TPC-DS version 1.4. It is also includes a modified query 67 which was done to improve
performance.
3. imp: This is the the set of queries created by Cloudera for testing in Impala.
The queries are available to provide an apples to apples comparison of run times to
compare with Impala. The queries were copied from this public repo:
https://github.com/cloudera/impala-tpcds-kit
The license file for these queries is also included:
impala_queries_license.txt
The Impala queries were changed for syntax and to remove partition cheating. Impala
doesn't support the concatenation with || so they changed the SQL to use their concat()
function. This was changed back. Intervals were changed from "interval 30 days" to
"'30 days'::interval". Query hints were removed. There are 117 removals of explicit
partition pruning in 63 unique queries. Queries 3, 4, 7, 9, 14, 24, 35, 46, and 51 were
heavily modified by Cloudera so reverting to the original TPC-DS version.
These options are set in tpcds_variables like so:
SQL_VERSION="tpcds"
You can also have the queries execute with "EXPLAIN ANALYZE" in order to see exactly
the query plan used, the cost, the memory used, etc. This is done in tpcds_variables.sh
like this:
EXPLAIN_ANALYZE="true"
Note: The EXPLAIN ANALYZE option is only available when using the standard TPC-DS
queries.
########################################################################################
Storage Options
########################################################################################
Table storage is defined in functions.sh and is configured for optimal performance. The
E9 query option will also overwrite the standard storage options and use slightly
different options for backward compatibility purposes.
########################################################################################
Prerequisites
########################################################################################
1. Greenplum Database or Apache HAWQ installed and running
2. Connectivity is possible to the MASTER_HOST and from the Data Nodes / Segment Hosts
3. Root access
########################################################################################
Installation
########################################################################################
1. ssh to the master host with root
ssh root@mdw
2. Download the tpcds.sh file
curl https://raw.githubusercontent.com/pivotalguru/TPC-DS/master/tpcds.sh > tpcds.sh
chmod 755 tpcds.sh
########################################################################################
Variables and Configuration
########################################################################################
By default, the installation will create the scripts in /pivotalguru/TPC-DS on the
Master host. This can be changed by editing the dynamically configured
tpcds_variables.sh file that is created the first time tpcds.sh is run.
Also by default, TPC-DS files are generated on each Segment Host / Data Node in the
Segement's PGDATA/pivotalguru directory. If there isn't enough space in this directory
in each Segment, you can create a symbolic link to a drive location that does have
enough space.
########################################################################################
HAWQ 2.x
########################################################################################
For HAWQ 2.x, this directory is named PGDATA/pivotalguru_$i where $i is 1 to
the GUC hawq_rm_nvseg_perquery_perseg_limit. See notes below for more information.
Example creating links with PGDATA = /data1/segment
with gpssh as root:
for i in $(seq 1 8); do mkdir /data$i/pivotalguru; done
chown gpadmin:gpadmin /data*/pivotalguru
for i in $(seq 1 8); do ln -s /data$i/pivotalguru /data/hawq/segment/pivotalguru_$i; done
The above is only for HAWQ 2.0. For GPDB and HAWQ 1.3, the segment directory structure
is different.
########################################################################################
Ambari installation
########################################################################################
If Ambari is used to manage the cluster, you will need to add the following changes to
"Custom hawq-site.xml":
optimizer_analyze_root_partition [on]
optimizer [on]
Change:
VM Overcommit Ratio [100]
Segment Memory Usage Limit [200] (based on the availability of RAM)
hawq_rm_stmt_vseg_memory [16gb] (based on the availability of RAM)
gp_autostats_mode [none]
Refer to Pivotal HDB documentation on how to set the Segment Memory Usage, VM Overcommit
Ratio, and Statement Memory settings.
########################################################################################
Execution
########################################################################################
1. Execute tpcds.sh
./tpcds.sh
########################################################################################
Notes
########################################################################################
- tpcds_variables.sh file will be created with variables you can adjust
- Files for the benchmark will be created in a sub-directory named pivotalguru located
in each segment directory on each segment host / data node.
You can update these directories to be symbolic links to better utilize the disk
volumes you have available.
- Example of running tpcds as root as a background process:
nohup ./tpcds.sh > tpcds.log 2>&1 < tpcds.log &
########################################################################################
TPC-DS Minor Modifications
########################################################################################
1. Change to SQL queries that subtracted or added days were modified slightly:
Old:
and (cast('2000-02-28' as date) + 30 days)
New:
and (cast('2000-02-28' as date) + '30 days'::interval)
This was done on queries: 5, 12, 16, 20, 21, 32, 37, 40, 77, 80, 82, 92, 94, 95, and 98.
2. Change to queries with ORDER BY on column alias to use sub-select.
Old:
select
sum(ss_net_profit) as total_sum
,s_state
,s_county
,grouping(s_state)+grouping(s_county) as lochierarchy
,rank() over (
partition by grouping(s_state)+grouping(s_county),
case when grouping(s_county) = 0 then s_state end
order by sum(ss_net_profit) desc) as rank_within_parent
from
store_sales
,date_dim d1
,store
where
d1.d_month_seq between 1212 and 1212+11
and d1.d_date_sk = ss_sold_date_sk
and s_store_sk = ss_store_sk
and s_state in
( select s_state
from (select s_state as s_state,
rank() over ( partition by s_state order by sum(ss_net_profit) desc) as ranking
from store_sales, store, date_dim
where d_month_seq between 1212 and 1212+11
and d_date_sk = ss_sold_date_sk
and s_store_sk = ss_store_sk
group by s_state
) tmp1
where ranking <= 5
)
group by rollup(s_state,s_county)
order by
lochierarchy desc
,case when lochierarchy = 0 then s_state end
,rank_within_parent
limit 100;
New:
select * from ( --new
select
sum(ss_net_profit) as total_sum
,s_state
,s_county
,grouping(s_state)+grouping(s_county) as lochierarchy
,rank() over (
partition by grouping(s_state)+grouping(s_county),
case when grouping(s_county) = 0 then s_state end
order by sum(ss_net_profit) desc) as rank_within_parent
from
store_sales
,date_dim d1
,store
where
d1.d_month_seq between 1212 and 1212+11
and d1.d_date_sk = ss_sold_date_sk
and s_store_sk = ss_store_sk
and s_state in
( select s_state
from (select s_state as s_state,
rank() over ( partition by s_state order by sum(ss_net_profit) desc) as ranking
from store_sales, store, date_dim
where d_month_seq between 1212 and 1212+11
and d_date_sk = ss_sold_date_sk
and s_store_sk = ss_store_sk
group by s_state
) tmp1
where ranking <= 5
)
group by rollup(s_state,s_county)
) AS sub --new
order by
lochierarchy desc
,case when lochierarchy = 0 then s_state end
,rank_within_parent
limit 100;
This was done on queries: 36 and 70.
3. Query templates were modified to exclude columns not found in the query. In these cases, the common
table expression used aliased columns but the dynamic filters included both the alias name as well as the
original name. Referencing the original column name instead of the alias causes the query parser to not
find the column.
This was done on query 86.
4. Added table aliases.
This was done on queries: 2, 14, and 23.
5. Added "limit 100" to very large result set queries. For the larger tests (e.g. 15TB), a few of the
TPC-DS queries can output a very large number of rows which are just discarded.
This was done on queries: 64, 34, and 71.