-
Notifications
You must be signed in to change notification settings - Fork 1
/
Copy pathinstall-hadoop-spark
2179 lines (1762 loc) · 83.3 KB
/
install-hadoop-spark
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
version5 this document is updated on 2020-1-12 by [email protected]
version6 updated 2020-1-30 by [email protected]
version7 updated 2020-4-1 add hbase staff
update 2020-9-29
version8 update 2021-3-31
1.更新集群配置之前需要同步时间的设置
2.由于防火墙导致hadoop连不上特点节点的解决方案
******************************
* *
* Install Hadoop *
* *
******************************
清除ubuntu系统盘缓存
sudo rm -rf ~/.cache
Ubuntu 64bit 16.04
00.every PCs use fix ipaddress
in desktop righttop corner
->edit connection
->Wired connection 1
->Edit
->IPv4 Setting
->Method: Mannual
->Addresses: Add
->edit: address 192.168.10.xxx
netmask 255.255.255.0
Gateway 192.168.10.1
DNS Servers 192.168.10.1
->Save
->Disconnect
->reconnect by click Wired connection1
in terminal:
ifconfig check ipaddress is ok.
0.update apt-get
use mirrors.aliyun.com to replace host(cn.archive.ubuntu.com) in /etc/apt/sources.list
you can not edit this list file directly. you should copy list file in home dir, edit it and use 'sudo' to copy back to /etc/apt/sources.list.
sudo cp ~/sources.list /etc/apt/sources.list
1.install ssh
sudo apt-get update
sudo apt-get install ssh
{
IF some lock error happens , use this:
sudo rm /var/cache/apt/archives/lock
sudo rm /var/lib/dpkg/lock
}
sudo apt-get install openssh-server
sudo apt-get install rsync
[in user homedir]$ mkdir .ssh
cd .ssh
ssh-keygen -t rsa
一直按enter键
cat ./id_rsa.pub >> ./authorized_keys
ssh localhost
when ask[yes/no] use yes.
就不再需要密码了
==========2.oracle jdk ========
1.tar -zxf jdk-8u192-linux-x64.tar.gz
2.sudo mv jdk1.8.0_192 /usr/local
3.nano ~/.bashrc
ADD
export JAVA_HOME=/usr/local/jdk1.8.0_192
export PATH=$PATH:$JAVA_HOME/bin
4.source ~/.bashrc
5.java -version
========== oracle jdk ========
3.install hadoop
mkdir ~/datadir
mkdir ~/namedir
mkdir ~/tempdir
tar -zxvf hadoop-2.9.2.tar.gz
sudo mv hadoop-2.9.2 /usr/local/hadoop
chmod 777 /usr/local/hadoop
4.modify /usr/local/hadoop/etc/hadoop/hadoop-env.sh
nano /usr/local/hadoop/etc/hadoop/hadoop-env.sh
replace: export JAVA_HOME=${JAVA_HOME}
to: export JAVA_HOME=/usr/local/jdk1.8.0_192
5.add hadoop to path
add following lines into ~/.bashrc
nano ~/.bashrc
export PATH=$PATH:/usr/local/hadoop/bin:/usr/local/hadoop/sbin
source ~/.bashrc
hadoop version
[single mode is ok. no namenode , no datanode]
[NOTE!!]
If you want to use multi-mode, please do everything above for every PCs.
Then follow the steps in <<<Multi-Node Setup>>>
[below is for pseu distribution mode one namenode and datanode in one machine.]
6.0.1 make dir
mkdir ~/tempdir
mkdir ~/namedir
mkdir ~/datadir
6.edit core-site.xml
<configuration>
<property>
<name>hadoop.tmp.dir</name>
<value>/home/hadoop/tempdir</value>
</property>
<property>
<name>fs.defaultFS</name>
<value>hdfs://ubuntu:9000</value>
</property>
</configuration>
7.edit hdfs-site.xml
<configuration>
<property>
<name>dfs.replication</name>
<value>1</value>
</property>
<property>
<name>dfs.block.size</name>
<value>134217728</value>
</property>
<property>
<name>dfs.namenode.name.dir</name>
<value>file:/home/hadoop/namedir</value>
</property>
<property>
<name>dfs.datanode.data.dir</name>
<value>file:/home/hadoop/datadir</value>
</property>
</configuration>
7.1 Edit hadoop-env.sh make more heap size
export HADOOP_CLIENT_OPTS="-Xmx1024m $HADOOP_CLIENT_OPTS"
It seems hadoop need at least 1024m to process a 128MB data array.
7.2 slaves
ubuntu
7.3 edit vi /etc/hosts in this machine(ubuntu).
sudo nano /etc/hosts
192.168.146.129 ubuntu
8.format (first time)
hadoop namenode -format
9.start
start-dfs.sh
10.web[ http://localhost:50070 ]
11.stop
stop-dfs.sh
if something is wrong, please check hadoop/etc/hadoop/logs
************************************************
** **
** Multi-Node Setup **
** **
************************************************
<<<Multi-Node Setup>>>
II Multi-Node Setup
0.write down every computes' hostname and ipv4 address.
open a terminal , you will see hostname after hadoop@xxx:
xxx is your hostname.
use ifconfig to see the ipv4 address.
1.get all nodes ip addresses.
2.edit vi /etc/hosts in all nodes.
192.168.10.10 hp
192.168.10.11 hadoop-master
192.168.10.12 hadoop-HP-Laptop-14s
192.168.10.13 hadoop-X200-2
192.168.10.14 hadoop-X200
192.168.10.20 hadoop-i3-2
192.168.10.21 hadoop-Lenovo1201
192.168.10.22 hadoop-Lenovo1202
copy hosts to every PCs:
scp /etc/hosts hadoop@hp:hosts.txt
scp /etc/hosts hadoop@hadoop-X200-2:hosts.txt
scp /etc/hosts hadoop@hadoop-X200:hosts.txt
scp /etc/hosts hadoop@hadoop-HP-Laptop-14s:hosts.txt
scp /etc/hosts hadoop@hadoop-i3-2:hosts.txt
scp /etc/hosts hadoop@hadoop-Lenovo1201:hosts.txt
scp /etc/hosts hadoop@hadoop-Lenovo1202:hosts.txt
in every PCs copy hosts.txt to /etc/hosts
sudo cp hosts.txt /etc/hosts
3.No key login for each nodes
If you use ssh-keygen -t rsa before, then you should ignore step 3.1
For master node:
(3.1) ssh-keygen -t rsa [use enter for none input,if you already has one key then skip this step]
(3.2) copy master pub key to every PCs.
ssh-copy-id -i ~/.ssh/id_rsa.pub hadoop@hp
ssh-copy-id -i ~/.ssh/id_rsa.pub hadoop@hadoop-master
ssh-copy-id -i ~/.ssh/id_rsa.pub hadoop@hadoop-HP-Laptop-14s
ssh-copy-id -i ~/.ssh/id_rsa.pub hadoop@hadoop-X200-2
ssh-copy-id -i ~/.ssh/id_rsa.pub hadoop@hadoop-X200
ssh-copy-id -i ~/.ssh/id_rsa.pub hadoop@hadoop-i3-2
ssh-copy-id -i ~/.ssh/id_rsa.pub hadoop@hadoop-Lenovo1201
ssh-copy-id -i ~/.ssh/id_rsa.pub hadoop@hadoop-Lenovo1202
(3.3) chmod 0600 ~/.ssh/authorized_keys
use ssh hadoop@hadoop-slave-1 to check if it is succ.
Login on each slave nodes, and do the same as master.
For slave nodes:
(3.1) ssh-keygen -t rsa [use enter for none input,if you already has one key then skip this step]
(3.2) copy node pub key to every others.
ssh-copy-id -i ~/.ssh/id_rsa.pub hadoop@hp
ssh-copy-id -i ~/.ssh/id_rsa.pub hadoop@hadoop-master
ssh-copy-id -i ~/.ssh/id_rsa.pub hadoop@hadoop-HP-Laptop-14s
ssh-copy-id -i ~/.ssh/id_rsa.pub hadoop@hadoop-X200-2
ssh-copy-id -i ~/.ssh/id_rsa.pub hadoop@hadoop-X200
ssh-copy-id -i ~/.ssh/id_rsa.pub hadoop@hadoop-i3-2
ssh-copy-id -i ~/.ssh/id_rsa.pub hadoop@hadoop-Lenovo1201
ssh-copy-id -i ~/.ssh/id_rsa.pub hadoop@hadoop-Lenovo1202
(3.3) chmod 0600 ~/.ssh/authorized_keys
**************************************
** **
** Architecture of a Hadoop Cluster**
** **
**************************************
A job can be split into tasks(map or reduce).
Hadoop:
{
NameNode(master)
dataNode(slaves or workders).
}
Yarn:
yarn-actor:
{
ResourceManager(in toycluster in master. manage the whole cluster.)
NodeManager(in every worker node. manage AM and tasks run in the node.)
}
AM, container, etc.
1.a job is start by client
2.yarn-resourceManager ask one nodeManager to create a application-master(AM) in charge.
3.AM create executors for tasks.
4.both AM and executors are run in containers, which is controlled by NodeManager.
**************************************
** **
** About memories settings: **
** **
**************************************
About memories settings:
ref:https://www.linode.com/docs/databases/hadoop/how-to-install-and-set-up-hadoop-cluster/#architecture-of-a-hadoop-cluster
[1]
yarn-site.xml->yarn.nodemanager.resource.memory-mb decide all mem can be allocated in one single machine for all-yarn-containers in this machine.
[2]
yarn-site.xml->yarn.scheduler.maximum-allocation-mb
yarn-site.xml->yarn.scheduler.minimum-allocation-mb
these two value decide how much one single container can consume.
[3]
mapred-site.xml with yarn.app.mapreduce.am.resource.mb
it is decide how much AM can consume mem. it should be between yarn.scheduler.maximum-allocation-mb and yarn-site.xml->yarn.scheduler.minimum-allocation-mb.
[4]
mapred-site.xml->mapreduce.map.memory.mb
mapred-site.xml->mapreduce.reduce.memory.mb
these two value decide how much one map/reduce operation can consume mem.
A map/reduce runs in one container, so map/reduce mem should be smaller than container.
Graph:
Machine(worker/node)
------------------------
| NodeManager |
| |
| [container1 run AM] |
| [container2 run map] |
| [container3 run Red] |
| |
------------------------
For A 2GB mem machine setting should be[by ref]:
yarn.nodemanager.resource.memory-mb 1536
yarn.scheduler.maximum-allocation-mb 1536
yarn.scheduler.minimum-allocation-mb 128
yarn.app.mapreduce.am.resource.mb 512
mapreduce.map.memory.mb 256
mapreduce.reduce.memory.mb 256
use ALT+CTRL+F1-F7 switch shell-mode/GUI-mode
in shell-mode it seems the whole system only use 400MB.
The ref value is not good. I give the settings(2GB):
yarn.nodemanager.resource.memory-mb 1536
yarn.scheduler.maximum-allocation-mb 1536
yarn.scheduler.minimum-allocation-mb 768
yarn.app.mapreduce.am.resource.mb 768
mapreduce.map.memory.mb 768
mapreduce.reduce.memory.mb 768
for 4GB machine, I give:
yarn.nodemanager.resource.memory-mb 3072
yarn.scheduler.maximum-allocation-mb 3072
yarn.scheduler.minimum-allocation-mb 768
yarn.app.mapreduce.am.resource.mb 768
mapreduce.map.memory.mb 768
mapreduce.reduce.memory.mb 768
**************************************
** **
** Install Hadoop on master **
** **
**************************************
<<<install hadoop on master>>>
4.install hadoop on master
make tempdir,namedir,datadir under /home/hadoop/
edit some config files:
/usr/local/hadoop/etc/hadoop
core-site.xml
<configuration>
<property>
<name>hadoop.tmp.dir</name>
<value>/home/hadoop/tempdir</value>
</property>
<property>
<name>fs.defaultFS</name>
<value>hdfs://hadoop-master:9000</value>
</property>
</configuration>
hdfs-site.xml
<configuration>
<property>
<name>dfs.replication</name>
<value>3</value>
</property>
<property>
<name>dfs.namenode.name.dir</name>
<value>file:/home/hadoop/namedir</value>
</property>
<property>
<name>dfs.datanode.data.dir</name>
<value>file:/home/hadoop/datadir</value>
</property>
</configuration>
mapred-site.xml (use yarn)
<configuration>
<property>
<name>mapreduce.framework.name</name>
<value>yarn</value>
</property>
<property>
<name>yarn.app.mapreduce.am.resource.mb</name>
<value>700</value>
<description>MR_App_can_use_mem_count</description>
</property>
<property>
<name>mapreduce.map.memory.mb</name>
<value>700</value>
</property>
<property>
<name>mapreduce.reduce.memory.mb</name>
<value>700</value>
</property>
<property>
<name>mapred.child.java.opts</name>
<value>-Xmx700m</value>
<description>only_java_can_use_mem_count</description>
</property>
</configuration>
Edit yarn-site.xml
<configuration>
<property>
<name>yarn.acl.enable</name>
<value>0</value>
</property>
<property>
<name>yarn.resourcemanager.hostname</name>
<value>hadoop-master</value>
<description>who_run_resourcemanager</description>
</property>
<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle</value>
<description>NodeManager_can_run_MapReduce</description>
</property>
<property>
<name>yarn.nodemanager.resource.memory-mb</name>
<value>2800</value>
<description>total_mem_a_node_can_use</description>
</property>
<property>
<name>yarn.scheduler.maximum-allocation-mb</name>
<value>700</value>
<description>one_task_can_use_max_mem</description>
</property>
<property>
<name>yarn.scheduler.minimum-allocation-mb</name>
<value>700</value>
<description>one_task_can_use_min_mem</description>
</property>
<property>
<name>yarn.nodemanager.vmem-check-enabled</name>
<value>false</value>
<description>check_is_pc_use_virtualMem</description>
</property>
<property>
<name>yarn.nodemanager.resource.cpu-vcores</name>
<value>2</value>
<description>avail_cpu_core_num</description>
</property>
<property>
<name>yarn.scheduler.minimum-allocation-vcores</name>
<value>1</value>
<description>one_container_min_cpu</description>
</property>
<property>
<name>yarn.scheduler.maximum-allocation-vcores</name>
<value>1</value>
<description>one_container_max_cpu</description>
</property>
<property>
<name>yarn.nodemanager.vmem-pmem-ratio</name>
<value>2.1</value>
</property>
</configuration>
Edit /usr/local/hadoop/etc/hadoop/salves
hp
hadoop-master
hadoop-HP-Laptop-14s
hadoop-X200-2
hadoop-X200
hadoop-i3-2
hadoop-Lenovo1201
hadoop-Lenovo1202
[Note!!]
in hadoop3.0 the slaves is called workers.
**************************************
** **
** Install Hadoop on Slaves **
** **
**************************************
<<<Install Hadoop on Slaves>>>
5.copy configured hadoop to slave
in master
tar czf ~/hadoop.tar.gz /usr/local/hadoop
scp ~/hadoop.tar.gz hadoop-slave-1:~
scp ~/hadoop.tar.gz hadoop-slave-2:~
6.un-tar configured hadoop on all slaves
tar xzf hadoop.tar.gz
mv hadoop /usr/local/hadoop
in fact, only copy settings-xml files to slaves are ok.
cd /usr/local/hadoop/etc
tar czf hadoop-etc.tar.gz hadoop
scp hadoop-etc.tar.gz hadoop@hp:~
scp hadoop-etc.tar.gz hadoop@hadoop-HP-Laptop-14s:~
scp hadoop-etc.tar.gz hadoop@hadoop-X200-2:~
scp hadoop-etc.tar.gz hadoop@hadoop-X200:~
scp hadoop-etc.tar.gz hadoop@hadoop-i3-2:~
scp hadoop-etc.tar.gz hadoop@hadoop-Lenovo1201:~
scp hadoop-etc.tar.gz hadoop@hadoop-Lenovo1202:~
tar xzf hadoop-etc.tar.gz
cp -rf hadoop /usr/local/hadoop/etc
rm ~/hadoop-etc.tar.gz
rm -rf ~/hadoop
for each machine edit mem settings:
nano /usr/local/hadoop/etc/hadoop/yarn-site.xml
nano /usr/local/hadoop/etc/hadoop/mapred-site.xml
check free memories use: free
[NOTE!]
Without GUI ubuntu only cost 350MB memories.
Boot ubuntu16.04 without GUI
sudo systemctl disable lightdm.service
Reactive GUI
sudo systemctl enable lightdm.service
sudo systemctl start lightdm.service
7.in master Start hadoop cluster
hdfs namenode -format
start-dfs.sh
start-yarn.sh
8.check daemons
for master
jps
NameNode
ResourceManager
for slaves
DataNode
NodeManager
9.Stop hadoop ( run on master )
stop-yarn.sh
stop-dfs.sh
shutdown machine:
shutdown -h now
10. DFS WEB Page
http://localhost:50070/explorer.html#/
Yarn WEB Page
http://hadoop-master:8088/cluster
HDFS MB is 1024 Kbytes * 1024 , 128MB is 134217728Bytes
****
uninstall openjdk
sudo apt-get remove openjdk*
install Oracle JDK
1.tar -zxf jdk-8u191-linux-i586.tar.gz
2.sudo mv jdk1.8.0_191 /usr/local
3.nano ~/.bashrc
ADD
export JAVA_HOME=/usr/local/jdk1.8.0_191
export PATH=$PATH:$JAVA_HOME/bin
4.source ~/.bashrc
5.java -version
6.Edit
export JAVA_HOME=/usr/local/jdk1.8.0_191
7.Edit .bashrc
export HADOOP_COMMON_LIB_NATIVE_DIR=/usr/local/hadoop/lib/native
export HADOOP_OPTS="-Djava.library.path=/usr/local/hadoop/lib/native"
export LD_LIBRARY_PATH=/usr/lib/hadoop/lib/native
***** DEV-1
build a MapReduce Program At Least Import following libs:
hadoop-common-2.7.6.jar
hadoop-hdfs-2.7.6.jar
hadoop-mapreduce-client-core-2.7.6.jar
***** DEV-2
Java Heap Error
use :
yarn-site.xml
<property>
<name>yarn.nodemanager.vmem-pmem-ratio</name>
<value>4.2</value>
</property>
mapred-site.xml
<property>
<name>mapred.child.java.opts</name>
<value>-Xmx2048m</value>
</property>
**** DEV-3
total 6400MB
hadoop(3 node , each 1vcore) use 360 seconds
hadoop(3 node , each 2vcore) use 120 seconds
C++ theratically use 6400/60=106
***** MR4C-1
I feel MR4C is a dead path.
=============================
Try more vcores
yarn-site.xml 2800,900,900 vcores 3 2 3
hadoop-env.sh 3000
mapred-site.xml map red java 900mb
try 1
18:13 map 0% red 0%
20:29 map 100% red 100%
try 2
23:10 map 0% red 0%
25:07 map 100% red 100%
try 3
26:48 map 0 red 0
28:59 map 100 red 100
===========================
2018-1-8 4:30
Try more vcores
yarn-site.xml 2800,700,700 vcores 4 2 4
hadoop-env.sh 3000
mapred-site.xml map red java 700mb
try-1
42:00 0 0
44:00 100 100
try-2
45:45 0 0
47:35 100 100
try-3
48:40 0 0
50:19 100 100
total 99 sec
try-4
52:32 0 0
54:08 100 100
total 96 sec
clear caches
sudo sh -c "sync; echo 3 > /proc/sys/vm/drop_caches"
Try single c++ program read 6 gb.
73-75 seconds
=================
2018-1-8 05:30
Try :
1. data count in record reader step.
2. not copy 128mb data into bytes-writable for mapper.
3. only send results int to mapper.
try-1
57:34 0 0
59:11 100 100
total 97 seconds
try-2
00:22 0 0
02:21 100 100
try-3
03:40 0 0
05:14 100 100
total 94 seconds
it seems use byteswritable in mapper does not take times.
**************************************
** **
** Install SPARK **
** **
**************************************
https://www.edureka.co/blog/spark-tutorial/
https://zh.hortonworks.com/tutorial/setting-up-a-spark-development-environment-with-scala/
https://www.tutorialspoint.com/apache_spark/apache_spark_installation.htm (install scala in terminal)
1.install java 1.8 SDK above , this should be done in hadoop.
2.install scala
download scala-2.11.6.tgz
2.1,2.2
tar xvf scala-2.11.6.tgz
sudo mv scala-2.11.6 /usr/local/scala
3.install spark
3.1,3.2
tar xvf spark-2.1.0-bin-hadoop2.7.tgz
sudo mv spark-2.1.0-bin-hadoop2.7 /usr/local/spark
3.3
nano ~/.bashrc
export PATH=$PATH:/usr/local/scala/bin
export PATH=$PATH:/usr/local/spark/bin:/usr/local/spark/sbin
3.4 source ~/.bashrc
3.5 scala -version
3.6 spark-shell
ctrl+c to quit
4.config spark
chmod 777 /usr/local/spark
chmod 777 /usr/local/scala
mkdir /home/hadoop/spark-tmp
cd /usr/local/spark/conf
cp spark-env.sh.template spark-env.sh
cp spark-defaults.conf.template spark-defaults.conf
cp slaves.template slaves
4.3 edit
nano spark-env.sh
export SCALA_HOME=/usr/local/scala
export JAVA_HOME=/usr/local/jdk1.8.0_192
export SPARK_MASTER_HOST=hadoop-master
export SPARK_MASTER_PORT=7077
export SPARK_WORKER_DIR=/home/hadoop/spark-tmp
export SPARK_EXECUTOR_INSTANCES=4
export SPARK_EXECUTOR_CORES=1
export SPARK_EXECUTOR_MEMORY=512m
export SPARK_LOCALITY_WAIT=60s
export SPARK_LOCALITY_WAIT_PROCESS=0s
SPARK_LOCAL_HOSTNAME=<data node hostname>
SPARK_LOCAL_IP=<data node ip>
SPARK_LOCALITY_WAIT is the time(sec) executor wait to start local-data task. If it is exceed the time, executor send task to a less-node node to run.
SPARK_LOCALITY_WAIT_PROCESS, I am not very clear about this one. It give the process wait time(sec) for cache data.
4.4
nano spark-defaults.conf
spark.master spark://hadoop-master:7077
4.5
nano slaves
hp
hadoop-master
hadoop-HP-Laptop-14s
hadoop-X200-2
hadoop-X200
hadoop-i3-2 20
hadoop-Lenovo1201 21
hadoop-Lenovo1202 22
5 done
start spark:
# auto start spark, not recommended.
/usr/local/spark/sbin/start-all.sh
# manual start master:
/usr/local/spark/sbin/start-master.sh
# manual start worker:
/usr/local/spark/sbin/start-slave.sh -c 1 -h hostname spark://MasterName:7077
check spark website: http://localhost:8080/
stop spark:
/usr/local/spark/sbin/stop-all.sh
stop-yarn.sh
stop-dfs.sh
===spark examples:
spark-shell
val textfile=sc.textFile("hdfs://hadoop-master:9000/test.txt")
textfile.count()
val words=textfile.flatMap(line=>line.split(" "))
words.coutn()
words.first()
val bin1=sc.binaryRecords("hdfs://hadoop-master:9000/text.txt",1)
bin1.first()
val bin2=bin1.map(x=>x)
val allfiles=sc.binaryRecords("hdfs://hadoop-master:9000/demo128",2)
allfiles.count()
use almost 180-240 seconds
try one block file speed.
val onefile=sc.binaryRecords("hdfs://hadoop-master:9000/demo128/image_0_0_4096_4096_4_i16_envi",2)
onefile.count()
use 21 seconds ??? so long!
try one block file speed with one record.
val file2=sc.binaryRecords("hdfs://hadoop-master:9000/demo128/image_0_0_4096_4096_4_i16_envi",134217728)
file2.count()
3-4seconds !! nice!
try all files with one record. config 2 exec with 1 core
val all2=sc.binaryRecords("hdfs://hadoop-master:9000/demo128",134217728)
all2.count()
6.9GB(54*128mb) total use 53 seconds ! :)
try all files with one record. config 2worker 2exec with 2 core
val all2=sc.binaryRecords("hdfs://hadoop-master:9000/demo128",134217728)
all2.count()
6.9GB(54*128mb) total use 90 seconds ! :)
worker wmem exec core seconds
config 2 3 2 2 bad
config -- 2 2 1 2.1 minus, mainly slowed by 178
config -- 1,2,2 2 1 master have too many task to work.
178 node is both worker and master, it slow down all process.
config -- 0,2,2 2 1 59s but master still have 4exec.
only2work -- 2,2 2 1 72s
slaves 2, master 1 2exec 1core 53s
slaves 2, master 1 2exec 1core 47s
slaves 2, master 1 8exec 1core 53s this seems exec number setting not work for spark-shell without yarn.
slaves 2, no master 2exec 1core 72s
attension: i don't add hadoop-master in slaves, but it still in workerlist.
attension: it seems must reboot after modify spark-env.sh and slaves.
next two :
0. simple java program for spark.
http://spark.praveendeshmane.co.in/spark/spark-wordcount-java-example.jsp
1. write simple java program for spark. reading hdfs binary file.
2. use filename as key. https://stackoverflow.com/questions/29686573/spark-obtaining-file-name-in-rdds
3.read multi files https://www.tutorialkart.com/apache-spark/read-multiple-text-files-to-single-rdd/
jar run on cluster
spark-submit --class SparkDemoOne --master spark://hadoop-master:7077 SparkDemoOne.jar
two workers only not include jar transfer time, use 75s.
3 workers only not include jar transfer time, use __75s.
3 workers only not include jar transfer time, use __70s.
4 workers only not include jar transfer time, use __s.
compute average of 6.7GB data:
c++ single thread use 75 seconds with about 100MB/s IO.
hadoop-yarn with 3 slave nodes use 95 seconds.
apache-spark with 3 workers use 75 seconds not including jar copying.
**************************************
** **
** HDFS Balance **
** **
**************************************
add following properties to hdfs-site.xml
<property>
<name>dfs.balancer.max-size-to-move</name>
<value>134217728</value>
</property>
<property>
<name>dfs.datanode.fsdataset.volume.choosing.policy</name>
<value>org.apache.hadoop.hdfs.server.datanode.fsdataset.AvailableSpaceVolumeChoosingPolicy</value>
</property>
<property>
<name>dfs.ha.automatic-failover.enabled</name>
<value>false</value>
</property>
<property>
<name>dfs.disk.balancer.enabled</name>
<value>true</value>
</property>
use command after start-dfs.sh:
hdfs balancer -policy datanode -threshold 1.0
====Set Locality Level
set spark-env.sh each node with it's own name.
SPARK_LOCAL_HOSTNAME=<data node hostname>
SPARK_LOCAL_IP=<data node ip>
You can configure the wait time before moving to other locality levels using:
spark.locality.wait a big value
This is not working , many data run in diff node(22-32)seconds.
Try something for Process_local!! next.
add spark-env.sh
export SPARK_LOCALITY_WAIT=60s
export SPARK_LOCALITY_WAIT_PROCESS=1s
start worker manually
1.
/usr/local/spark/sbin/start-master.sh
2.
in each worker do:
ssh hadoop@hadoop-slave-1
/usr/local/spark/sbin/start-slave.sh -h hadoop-slave-1 spark://hadoop-master:7077
ssh hadoop@hadoop-slave-2
/usr/local/spark/sbin/start-slave.sh -h hadoop-slave-2 spark://hadoop-master:7077
ssh hadoop@hp
/usr/local/spark/sbin/start-slave.sh -h hp spark://hadoop-master:7077
===try1
hp use 4 cores, is very slow(10s one task, 11GB all use 2min ), I make it 2 cores and try.
ssh hadoop@hp
/usr/local/spark/sbin/start-slave.sh -h hp -c 2 spark://hadoop-master:7077
1.5min
===try2 each worker have 1 core.
/usr/local/spark/sbin/start-slave.sh -h hp -c 1 spark://hadoop-master:7077
/usr/local/spark/sbin/start-slave.sh -h hadoop-slave-1 -c 1 spark://hadoop-master:7077
/usr/local/spark/sbin/start-slave.sh -h hadoop-slave-2 -c 1 spark://hadoop-master:7077
total including jar copying, use 1.2min ~ 72s
===try3 add master as partial worker
/usr/local/spark/sbin/start-slave.sh -h hadoop-master -c 1 spark://hadoop-master:7077
total use 57s
===try3 4workers(1core) 215*128MB=26.875GB
total 2.2min=132seconds
*********************************
* *
* Try Seqfile Read performance *
* sequence read *
* 2020-1-18 *
*********************************
ReadType dura(sec) spd(MB/s)
S Key,Val 16.721 248.13
S Only-Key 14.527 285.61
D Key,Val 15.369 269.96
Only-Key 14.484 286.45
Key,Val 15.202 272.92
Only-Key 14.567 284.82
Key,Val 15.663 264.89
Only-Key 14.741 281.46
D Key,Val 76.547 54.20
I Only-Key 71.993 57.63
S Key,Val 77.488 53.54
K Only-Key 73.136 56.73
Key,Val 77.317 53.66
Only-Key 73.527 56.43
Read total 4148.2MB seqfile. blockSize 128MB, each record 14.4MB.
Record count:288.
VMVare use SSD readspd 270MB/s, disk readspd 56MB/s.
Read Only-key use 1-2seconds less for SSD, 4-5seconds less for Disk.
****************************
*
* Try Seqfile Read performance
* random read with record interval x.
* 2020-1-18
****************************
DISK KV 49 42.33
inter2 K 44 47.14
KV 46.5 44.60
K 44.1 47.03
DISK KV 25.3 40.99
inter4 K 21.8 47.57
KV 24.3 42.68
K 22.3 46.50
DISK KV 1.04 498.58
inter8 K 0.76 682.27
KV 1.36 381.27
K 0.79 656.36
DISK KV 0.66 392.82
inter16 K 0.65 398.87
KV 0.71 365.16
K 0.46 563.61
DISK KV 0.16 180.04
inter K 0.08 360.09