install-hadoop-spark

version5 this document is updated on 2020-1-12 by wangfengdev@163.com
version6 updated 2020-1-30 by wangfengdev@163.com
version7 updated 2020-4-1 add hbase staff
update 2020-9-29
version8 update 2021-3-31 
  1.更新集群配置之前需要同步时间的设置
  2.由于防火墙导致hadoop连不上特点节点的解决方案


******************************
*                            *
*       Install Hadoop       *
*                            *
******************************

清除ubuntu系统盘缓存
sudo rm -rf ~/.cache

Ubuntu 64bit 16.04

00.every PCs use fix ipaddress
in desktop righttop corner 
->edit connection
->Wired connection 1
->Edit
->IPv4 Setting
->Method: Mannual
->Addresses: Add
->edit: address 192.168.10.xxx 
        netmask 255.255.255.0
	Gateway 192.168.10.1
        DNS Servers 192.168.10.1
->Save
->Disconnect
->reconnect by click Wired connection1
in terminal:
ifconfig check ipaddress is ok.


0.update apt-get
use mirrors.aliyun.com to replace host(cn.archive.ubuntu.com) in /etc/apt/sources.list
you can not edit this list file directly. you should copy list file in home dir, edit it and use 'sudo' to copy back to /etc/apt/sources.list.
sudo cp ~/sources.list /etc/apt/sources.list


1.install ssh
sudo apt-get update
sudo apt-get install ssh

{
	IF some lock error happens , use this:
	sudo rm /var/cache/apt/archives/lock
	sudo rm /var/lib/dpkg/lock
}

sudo apt-get install openssh-server
sudo apt-get install rsync
[in user homedir]$ mkdir .ssh
cd .ssh
ssh-keygen -t rsa
一直按enter键
cat ./id_rsa.pub >> ./authorized_keys  
ssh localhost 
when ask[yes/no] use yes.
就不再需要密码了

==========2.oracle jdk ========
1.tar -zxf jdk-8u192-linux-x64.tar.gz
2.sudo mv jdk1.8.0_192 /usr/local
3.nano ~/.bashrc
ADD
export JAVA_HOME=/usr/local/jdk1.8.0_192
export PATH=$PATH:$JAVA_HOME/bin

4.source ~/.bashrc
5.java -version     
========== oracle jdk ========


3.install hadoop
mkdir ~/datadir
mkdir ~/namedir
mkdir ~/tempdir

tar -zxvf hadoop-2.9.2.tar.gz
sudo mv hadoop-2.9.2 /usr/local/hadoop
chmod 777 /usr/local/hadoop


4.modify /usr/local/hadoop/etc/hadoop/hadoop-env.sh
nano /usr/local/hadoop/etc/hadoop/hadoop-env.sh
replace: export JAVA_HOME=${JAVA_HOME}
to: export JAVA_HOME=/usr/local/jdk1.8.0_192

5.add hadoop to path
add following lines into ~/.bashrc
nano ~/.bashrc
export PATH=$PATH:/usr/local/hadoop/bin:/usr/local/hadoop/sbin
source ~/.bashrc
hadoop version


[single mode is ok. no namenode , no datanode]


[NOTE!!]
If you want to use multi-mode, please do everything above for every PCs.
Then follow the steps in <<<Multi-Node Setup>>>


[below is for pseu distribution mode one namenode and datanode in one machine.]

6.0.1 make dir
mkdir ~/tempdir
mkdir ~/namedir
mkdir ~/datadir

6.edit core-site.xml

<configuration>
	<property>
		<name>hadoop.tmp.dir</name>
		<value>/home/hadoop/tempdir</value>
	</property>
	<property>
		<name>fs.defaultFS</name>
		<value>hdfs://ubuntu:9000</value>
	</property>
</configuration>

7.edit hdfs-site.xml

<configuration>
    <property>
        <name>dfs.replication</name>
        <value>1</value>
    </property>
    <property>
        <name>dfs.block.size</name>
        <value>134217728</value>
    </property>
    <property>
        <name>dfs.namenode.name.dir</name>
        <value>file:/home/hadoop/namedir</value>
    </property>
    <property>
        <name>dfs.datanode.data.dir</name>
        <value>file:/home/hadoop/datadir</value>
    </property>
</configuration>

7.1 Edit hadoop-env.sh make more heap size
export HADOOP_CLIENT_OPTS="-Xmx1024m $HADOOP_CLIENT_OPTS"
It seems hadoop need at least 1024m to process a 128MB data array.

7.2 slaves
ubuntu

7.3 edit vi /etc/hosts in this machine(ubuntu).
sudo nano /etc/hosts
192.168.146.129 ubuntu


8.format (first time)
hadoop namenode -format


9.start
start-dfs.sh

10.web[ http://localhost:50070 ]

11.stop
stop-dfs.sh

if something is wrong, please check hadoop/etc/hadoop/logs


************************************************
**                                            **
**           Multi-Node Setup                 **
**                                            **
************************************************
<<<Multi-Node Setup>>>
II Multi-Node Setup
0.write down every computes' hostname and ipv4 address.
open a terminal , you will see hostname after hadoop@xxx:
xxx is your hostname.
use ifconfig to see the ipv4 address.


1.get all nodes ip addresses.

2.edit vi /etc/hosts in all nodes.
192.168.10.10 hp
192.168.10.11 hadoop-master
192.168.10.12 hadoop-HP-Laptop-14s
192.168.10.13 hadoop-X200-2
192.168.10.14 hadoop-X200
192.168.10.20 hadoop-i3-2
192.168.10.21 hadoop-Lenovo1201
192.168.10.22 hadoop-Lenovo1202

copy hosts to every PCs:
scp /etc/hosts hadoop@hp:hosts.txt
scp /etc/hosts hadoop@hadoop-X200-2:hosts.txt
scp /etc/hosts hadoop@hadoop-X200:hosts.txt

scp /etc/hosts hadoop@hadoop-HP-Laptop-14s:hosts.txt

scp /etc/hosts hadoop@hadoop-i3-2:hosts.txt
scp /etc/hosts hadoop@hadoop-Lenovo1201:hosts.txt
scp /etc/hosts hadoop@hadoop-Lenovo1202:hosts.txt

in every PCs copy hosts.txt to /etc/hosts
sudo cp hosts.txt /etc/hosts

3.No key login for each nodes
If you use ssh-keygen -t rsa before, then you should ignore step 3.1
For master node:
(3.1) ssh-keygen -t rsa [use enter for none input,if you already has one key then skip this step]

(3.2) copy master pub key to every PCs.
ssh-copy-id -i ~/.ssh/id_rsa.pub hadoop@hp
ssh-copy-id -i ~/.ssh/id_rsa.pub hadoop@hadoop-master
ssh-copy-id -i ~/.ssh/id_rsa.pub hadoop@hadoop-HP-Laptop-14s
ssh-copy-id -i ~/.ssh/id_rsa.pub hadoop@hadoop-X200-2
ssh-copy-id -i ~/.ssh/id_rsa.pub hadoop@hadoop-X200
ssh-copy-id -i ~/.ssh/id_rsa.pub hadoop@hadoop-i3-2
ssh-copy-id -i ~/.ssh/id_rsa.pub hadoop@hadoop-Lenovo1201
ssh-copy-id -i ~/.ssh/id_rsa.pub hadoop@hadoop-Lenovo1202

(3.3) chmod 0600 ~/.ssh/authorized_keys
use ssh hadoop@hadoop-slave-1 to check if it is succ.
Login on each slave nodes, and do the same as master.
For slave nodes:
(3.1) ssh-keygen -t rsa [use enter for none input,if you already has one key then skip this step]
(3.2) copy node pub key to every others.
ssh-copy-id -i ~/.ssh/id_rsa.pub hadoop@hp
ssh-copy-id -i ~/.ssh/id_rsa.pub hadoop@hadoop-master
ssh-copy-id -i ~/.ssh/id_rsa.pub hadoop@hadoop-HP-Laptop-14s
ssh-copy-id -i ~/.ssh/id_rsa.pub hadoop@hadoop-X200-2
ssh-copy-id -i ~/.ssh/id_rsa.pub hadoop@hadoop-X200
ssh-copy-id -i ~/.ssh/id_rsa.pub hadoop@hadoop-i3-2
ssh-copy-id -i ~/.ssh/id_rsa.pub hadoop@hadoop-Lenovo1201
ssh-copy-id -i ~/.ssh/id_rsa.pub hadoop@hadoop-Lenovo1202
(3.3) chmod 0600 ~/.ssh/authorized_keys
 
**************************************
**                                  **
**  Architecture of a Hadoop Cluster**
**                                  **
**************************************
A job can be split into tasks(map or reduce).

Hadoop:
{
	NameNode(master) 
	dataNode(slaves or workders).
}

Yarn:
yarn-actor:
{
	ResourceManager(in toycluster in master. manage the whole cluster.)
	NodeManager(in every worker node. manage AM and tasks run in the node.)
}

AM, container, etc.
1.a job is start by client
2.yarn-resourceManager ask one nodeManager to create a application-master(AM) in charge.
3.AM create executors for tasks.
4.both AM and executors are run in containers, which is controlled by NodeManager.


**************************************
**                                  **
**  About memories settings:        **
**                                  **
**************************************
About memories settings:
ref:https://www.linode.com/docs/databases/hadoop/how-to-install-and-set-up-hadoop-cluster/#architecture-of-a-hadoop-cluster

[1]
yarn-site.xml->yarn.nodemanager.resource.memory-mb decide all mem can be allocated in one single machine for all-yarn-containers in this machine.

[2]
yarn-site.xml->yarn.scheduler.maximum-allocation-mb 
yarn-site.xml->yarn.scheduler.minimum-allocation-mb
these two value decide how much one single container can consume.

[3]
mapred-site.xml with yarn.app.mapreduce.am.resource.mb
it is decide how much AM can consume mem. it should be between  yarn.scheduler.maximum-allocation-mb and yarn-site.xml->yarn.scheduler.minimum-allocation-mb.

[4]
mapred-site.xml->mapreduce.map.memory.mb 
mapred-site.xml->mapreduce.reduce.memory.mb
these two value decide how much one map/reduce operation can consume mem.
A map/reduce runs in one container, so map/reduce mem should be smaller than container.

Graph:
Machine(worker/node)
------------------------
|  NodeManager          |     
|                       |  
|  [container1 run AM]  |
|  [container2 run map] |
|  [container3 run Red] |
|                       |
------------------------

For A 2GB mem machine setting should be[by ref]:
yarn.nodemanager.resource.memory-mb	1536
yarn.scheduler.maximum-allocation-mb	1536
yarn.scheduler.minimum-allocation-mb	128
yarn.app.mapreduce.am.resource.mb	512
mapreduce.map.memory.mb     256
mapreduce.reduce.memory.mb	256

use ALT+CTRL+F1-F7 switch shell-mode/GUI-mode
in shell-mode it seems the whole system only use 400MB.

The ref value is not good. I give the settings(2GB):
yarn.nodemanager.resource.memory-mb	1536
yarn.scheduler.maximum-allocation-mb	1536
yarn.scheduler.minimum-allocation-mb	768
yarn.app.mapreduce.am.resource.mb	768
mapreduce.map.memory.mb     768
mapreduce.reduce.memory.mb  768

for 4GB machine, I give:
yarn.nodemanager.resource.memory-mb	3072
yarn.scheduler.maximum-allocation-mb	3072
yarn.scheduler.minimum-allocation-mb	768
yarn.app.mapreduce.am.resource.mb	768
mapreduce.map.memory.mb     768
mapreduce.reduce.memory.mb  768


**************************************
**                                  **
**  Install Hadoop on master        **
**                                  **
**************************************
<<<install hadoop on master>>>
4.install hadoop on master
make tempdir,namedir,datadir under /home/hadoop/
edit some config files:
/usr/local/hadoop/etc/hadoop

core-site.xml 
<configuration>
	<property>
		<name>hadoop.tmp.dir</name>
		<value>/home/hadoop/tempdir</value>
	</property>
	<property>
		<name>fs.defaultFS</name>
		<value>hdfs://hadoop-master:9000</value>
	</property>
</configuration>


hdfs-site.xml
<configuration>
	<property>
		<name>dfs.replication</name>
		<value>3</value>
	</property>
	<property>
		<name>dfs.namenode.name.dir</name>
		<value>file:/home/hadoop/namedir</value>
	</property>
	<property>
		<name>dfs.datanode.data.dir</name>
		<value>file:/home/hadoop/datadir</value>
	</property>
</configuration>


mapred-site.xml (use yarn)
<configuration>
	<property>
		<name>mapreduce.framework.name</name>
		<value>yarn</value>
	</property>
	<property>
		<name>yarn.app.mapreduce.am.resource.mb</name>
		<value>700</value>
		<description>MR_App_can_use_mem_count</description>
	</property>
	<property>
		<name>mapreduce.map.memory.mb</name>
		<value>700</value>
	</property>
	<property>
		<name>mapreduce.reduce.memory.mb</name>
		<value>700</value>
	</property>
	<property>
		<name>mapred.child.java.opts</name>
		<value>-Xmx700m</value>
		<description>only_java_can_use_mem_count</description>
	</property>
</configuration>

Edit yarn-site.xml
<configuration>
    <property>
		<name>yarn.acl.enable</name>
		<value>0</value>
    </property>
    <property>
        <name>yarn.resourcemanager.hostname</name>
        <value>hadoop-master</value>
	    <description>who_run_resourcemanager</description>
    </property>
    <property>
		<name>yarn.nodemanager.aux-services</name>
		<value>mapreduce_shuffle</value>
 	    <description>NodeManager_can_run_MapReduce</description>
    </property>
	<property>
		<name>yarn.nodemanager.resource.memory-mb</name>
		<value>2800</value>
		<description>total_mem_a_node_can_use</description>
	</property>
	<property>
		<name>yarn.scheduler.maximum-allocation-mb</name>
		<value>700</value>
		<description>one_task_can_use_max_mem</description>
	</property>
	<property>
		<name>yarn.scheduler.minimum-allocation-mb</name>
		<value>700</value>
		<description>one_task_can_use_min_mem</description>
	</property>
	<property>
		<name>yarn.nodemanager.vmem-check-enabled</name>
		<value>false</value>
		<description>check_is_pc_use_virtualMem</description>
	</property>
	<property>
		<name>yarn.nodemanager.resource.cpu-vcores</name>
		<value>2</value>
		<description>avail_cpu_core_num</description>
	</property> 
	<property>
		<name>yarn.scheduler.minimum-allocation-vcores</name>
		<value>1</value>
		<description>one_container_min_cpu</description>
	</property>
	<property>
		<name>yarn.scheduler.maximum-allocation-vcores</name>
		<value>1</value>
		<description>one_container_max_cpu</description>
	</property>
	<property>
		<name>yarn.nodemanager.vmem-pmem-ratio</name>
		<value>2.1</value>
	</property>
</configuration>


Edit /usr/local/hadoop/etc/hadoop/salves
hp
hadoop-master
hadoop-HP-Laptop-14s
hadoop-X200-2
hadoop-X200
hadoop-i3-2
hadoop-Lenovo1201
hadoop-Lenovo1202

[Note!!]
in hadoop3.0 the slaves is called workers.


**************************************
**                                  **
**  Install Hadoop on Slaves        **
**                                  **
**************************************
<<<Install Hadoop on Slaves>>>
5.copy configured hadoop to slave
in master
tar czf ~/hadoop.tar.gz /usr/local/hadoop
scp ~/hadoop.tar.gz hadoop-slave-1:~
scp ~/hadoop.tar.gz hadoop-slave-2:~

6.un-tar configured hadoop on all slaves
tar xzf hadoop.tar.gz
mv hadoop /usr/local/hadoop


in fact, only copy settings-xml files to slaves are ok.
cd /usr/local/hadoop/etc
tar czf hadoop-etc.tar.gz hadoop

scp hadoop-etc.tar.gz hadoop@hp:~
scp hadoop-etc.tar.gz hadoop@hadoop-HP-Laptop-14s:~
scp hadoop-etc.tar.gz hadoop@hadoop-X200-2:~
scp hadoop-etc.tar.gz hadoop@hadoop-X200:~
scp hadoop-etc.tar.gz hadoop@hadoop-i3-2:~
scp hadoop-etc.tar.gz hadoop@hadoop-Lenovo1201:~
scp hadoop-etc.tar.gz hadoop@hadoop-Lenovo1202:~

tar xzf hadoop-etc.tar.gz 
cp -rf hadoop /usr/local/hadoop/etc
rm ~/hadoop-etc.tar.gz
rm -rf ~/hadoop

for each machine edit mem settings:
nano /usr/local/hadoop/etc/hadoop/yarn-site.xml
nano /usr/local/hadoop/etc/hadoop/mapred-site.xml
check free memories use: free

[NOTE!]
Without GUI ubuntu only cost 350MB memories.

Boot ubuntu16.04 without GUI
sudo systemctl disable lightdm.service

Reactive GUI
sudo systemctl enable lightdm.service
sudo systemctl start lightdm.service


7.in master Start hadoop cluster

hdfs namenode -format
start-dfs.sh
start-yarn.sh

8.check daemons
for master
jps
NameNode
ResourceManager

for slaves
DataNode
NodeManager


9.Stop hadoop ( run on master )
stop-yarn.sh
stop-dfs.sh

shutdown machine:
shutdown -h now

10. DFS WEB Page
http://localhost:50070/explorer.html#/
Yarn WEB Page
http://hadoop-master:8088/cluster
HDFS MB is 1024 Kbytes * 1024 , 128MB is 134217728Bytes


****
uninstall openjdk
sudo apt-get remove openjdk*


install Oracle JDK
1.tar -zxf jdk-8u191-linux-i586.tar.gz
2.sudo mv jdk1.8.0_191 /usr/local
3.nano ~/.bashrc
ADD
export JAVA_HOME=/usr/local/jdk1.8.0_191
export PATH=$PATH:$JAVA_HOME/bin

4.source ~/.bashrc
5.java -version

6.Edit 
export JAVA_HOME=/usr/local/jdk1.8.0_191
7.Edit .bashrc
export HADOOP_COMMON_LIB_NATIVE_DIR=/usr/local/hadoop/lib/native
export HADOOP_OPTS="-Djava.library.path=/usr/local/hadoop/lib/native"
export LD_LIBRARY_PATH=/usr/lib/hadoop/lib/native
 

***** DEV-1
build a MapReduce Program At Least Import following libs:
hadoop-common-2.7.6.jar
hadoop-hdfs-2.7.6.jar
hadoop-mapreduce-client-core-2.7.6.jar

***** DEV-2
Java Heap Error
use :

yarn-site.xml
<property>
<name>yarn.nodemanager.vmem-pmem-ratio</name>
<value>4.2</value>
</property>

mapred-site.xml
<property>
<name>mapred.child.java.opts</name>
<value>-Xmx2048m</value>
</property>

**** DEV-3
total 6400MB 
hadoop(3 node , each 1vcore) use 360 seconds
hadoop(3 node , each 2vcore) use 120 seconds
C++ theratically use 6400/60=106


***** MR4C-1
I feel MR4C is a dead path.

=============================
Try more vcores
yarn-site.xml 2800,900,900 vcores 3 2 3
hadoop-env.sh  3000
mapred-site.xml  map red java 900mb
try 1
18:13 map 0% red 0%
20:29 map 100% red 100%
try 2
23:10 map 0% red 0%
25:07 map 100% red 100%
try 3
26:48 map 0 red 0
28:59 map 100 red 100
===========================
2018-1-8 4:30

Try more vcores
yarn-site.xml 2800,700,700 vcores 4 2 4
hadoop-env.sh  3000
mapred-site.xml  map red java 700mb
try-1
42:00 0 0
44:00 100 100
try-2
45:45 0 0
47:35 100 100
try-3
48:40 0 0 
50:19 100 100
total 99 sec
try-4
52:32 0 0 
54:08 100 100
total 96 sec

clear caches
sudo sh -c "sync; echo 3 > /proc/sys/vm/drop_caches"

Try single c++ program read 6 gb.
73-75 seconds

=================
2018-1-8 05:30
Try :
1. data count in record reader step.
2. not copy 128mb data into bytes-writable for mapper.
3. only send results int to mapper.

try-1
57:34 0 0
59:11 100 100
total 97 seconds

try-2
00:22 0 0 
02:21 100 100

try-3
03:40 0 0 
05:14 100 100
total 94 seconds

it seems use byteswritable in mapper does not take times.

**************************************
**                                  **
**  Install SPARK                   **
**                                  **
**************************************
https://www.edureka.co/blog/spark-tutorial/
https://zh.hortonworks.com/tutorial/setting-up-a-spark-development-environment-with-scala/
https://www.tutorialspoint.com/apache_spark/apache_spark_installation.htm (install scala in terminal)

1.install java 1.8 SDK above , this should be done in hadoop.
2.install scala
download scala-2.11.6.tgz
2.1,2.2
tar xvf scala-2.11.6.tgz
sudo mv scala-2.11.6 /usr/local/scala


3.install spark
3.1,3.2 
tar xvf spark-2.1.0-bin-hadoop2.7.tgz 
sudo mv spark-2.1.0-bin-hadoop2.7 /usr/local/spark 

3.3
nano ~/.bashrc 
export PATH=$PATH:/usr/local/scala/bin 
export PATH=$PATH:/usr/local/spark/bin:/usr/local/spark/sbin

3.4 source ~/.bashrc

3.5 scala -version
3.6 spark-shell
ctrl+c to quit

4.config spark
chmod 777 /usr/local/spark
chmod 777 /usr/local/scala
mkdir /home/hadoop/spark-tmp
cd /usr/local/spark/conf
cp spark-env.sh.template spark-env.sh
cp spark-defaults.conf.template spark-defaults.conf
cp slaves.template slaves

4.3 edit
nano spark-env.sh
export SCALA_HOME=/usr/local/scala
export JAVA_HOME=/usr/local/jdk1.8.0_192
export SPARK_MASTER_HOST=hadoop-master
export SPARK_MASTER_PORT=7077
export SPARK_WORKER_DIR=/home/hadoop/spark-tmp
export SPARK_EXECUTOR_INSTANCES=4
export SPARK_EXECUTOR_CORES=1
export SPARK_EXECUTOR_MEMORY=512m
export SPARK_LOCALITY_WAIT=60s
export SPARK_LOCALITY_WAIT_PROCESS=0s
SPARK_LOCAL_HOSTNAME=<data node hostname>
SPARK_LOCAL_IP=<data node ip>

SPARK_LOCALITY_WAIT is the time(sec) executor wait to start local-data task. If it is exceed the time, executor send task to a less-node node to run.

SPARK_LOCALITY_WAIT_PROCESS, I am not very clear about this one. It give the process wait time(sec) for cache data.


4.4
nano spark-defaults.conf
spark.master  spark://hadoop-master:7077

4.5
nano slaves
hp
hadoop-master
hadoop-HP-Laptop-14s
hadoop-X200-2
hadoop-X200
hadoop-i3-2  20
hadoop-Lenovo1201  21
hadoop-Lenovo1202  22

5 done
start spark:
# auto start spark, not recommended.
/usr/local/spark/sbin/start-all.sh

# manual start master:
/usr/local/spark/sbin/start-master.sh
# manual start worker:
/usr/local/spark/sbin/start-slave.sh -c 1 -h hostname spark://MasterName:7077

check spark website: http://localhost:8080/

stop spark: 
/usr/local/spark/sbin/stop-all.sh
stop-yarn.sh
stop-dfs.sh

===spark examples:
spark-shell
val textfile=sc.textFile("hdfs://hadoop-master:9000/test.txt")
textfile.count()
val words=textfile.flatMap(line=>line.split(" "))
words.coutn()
words.first()

val bin1=sc.binaryRecords("hdfs://hadoop-master:9000/text.txt",1)
bin1.first()
val bin2=bin1.map(x=>x)

val allfiles=sc.binaryRecords("hdfs://hadoop-master:9000/demo128",2)
allfiles.count()
use almost 180-240 seconds

try one block file speed.
val onefile=sc.binaryRecords("hdfs://hadoop-master:9000/demo128/image_0_0_4096_4096_4_i16_envi",2)
onefile.count()
use 21 seconds ??? so long!

try one block file speed with one record.
val file2=sc.binaryRecords("hdfs://hadoop-master:9000/demo128/image_0_0_4096_4096_4_i16_envi",134217728)
file2.count()
3-4seconds !! nice!

try all files with one record. config 2 exec with 1 core
val all2=sc.binaryRecords("hdfs://hadoop-master:9000/demo128",134217728)
all2.count()
6.9GB(54*128mb) total use 53 seconds ! :) 

try all files with one record. config 2worker 2exec with 2 core
val all2=sc.binaryRecords("hdfs://hadoop-master:9000/demo128",134217728)
all2.count()
6.9GB(54*128mb) total use 90 seconds ! :) 
        worker wmem  exec  core  seconds
config   2     3 	2     2     bad
config   --    2	2     1     2.1 minus, mainly slowed by 178
config   --    1,2,2	2     1     master have too many task to work.

178 node is both worker and master, it slow down all process.
config   --    0,2,2     2     1   59s but master still have 4exec.
only2work --   2,2       2     1   72s

slaves 2, master 1 2exec 1core 53s
slaves 2, master 1 2exec 1core 47s
slaves 2, master 1 8exec 1core 53s this seems exec number setting not work for spark-shell without yarn.
slaves 2, no master 2exec 1core 72s

attension: i don't add hadoop-master in slaves, but it still in workerlist.
attension: it seems must reboot after modify spark-env.sh and slaves.

next two :
0. simple java program for spark.
http://spark.praveendeshmane.co.in/spark/spark-wordcount-java-example.jsp
1. write simple java program for spark. reading hdfs binary file.
2. use filename as key. https://stackoverflow.com/questions/29686573/spark-obtaining-file-name-in-rdds
3.read multi files https://www.tutorialkart.com/apache-spark/read-multiple-text-files-to-single-rdd/

jar run on cluster
 spark-submit --class SparkDemoOne --master spark://hadoop-master:7077 SparkDemoOne.jar
two workers only not include jar transfer time, use 75s.  
3 workers only not include jar transfer time, use __75s. 
3 workers only not include jar transfer time, use __70s. 
4 workers  only not include jar transfer time, use __s.

compute average of 6.7GB data:
c++ single thread use 75 seconds with about 100MB/s IO.
hadoop-yarn with 3 slave nodes use 95 seconds.
apache-spark with 3 workers use 75 seconds not including jar copying.
 
**************************************
**                                  **
**  HDFS Balance                    **
**                                  **
**************************************
add following properties to hdfs-site.xml

<property>
<name>dfs.balancer.max-size-to-move</name>
<value>134217728</value>
</property>
<property>
<name>dfs.datanode.fsdataset.volume.choosing.policy</name>
<value>org.apache.hadoop.hdfs.server.datanode.fsdataset.AvailableSpaceVolumeChoosingPolicy</value>
</property>
<property>
<name>dfs.ha.automatic-failover.enabled</name>
<value>false</value>
</property>
<property>
<name>dfs.disk.balancer.enabled</name>
<value>true</value>
</property>

use command after start-dfs.sh:
hdfs balancer -policy datanode -threshold 1.0


====Set Locality Level
set spark-env.sh each node with it's own name.
SPARK_LOCAL_HOSTNAME=<data node hostname>
SPARK_LOCAL_IP=<data node ip>
You can configure the wait time before moving to other locality levels using:
spark.locality.wait a big value

This is not working , many data run in diff node(22-32)seconds.

Try something for Process_local!! next.

add spark-env.sh
export SPARK_LOCALITY_WAIT=60s
export SPARK_LOCALITY_WAIT_PROCESS=1s

start worker manually
1.
/usr/local/spark/sbin/start-master.sh
2.
in each worker do:
ssh hadoop@hadoop-slave-1
/usr/local/spark/sbin/start-slave.sh -h hadoop-slave-1 spark://hadoop-master:7077

ssh hadoop@hadoop-slave-2
/usr/local/spark/sbin/start-slave.sh -h hadoop-slave-2 spark://hadoop-master:7077

ssh hadoop@hp
/usr/local/spark/sbin/start-slave.sh -h hp spark://hadoop-master:7077

===try1
hp use 4 cores, is very slow(10s one task, 11GB all use 2min ), I make it 2 cores and try.

ssh hadoop@hp
/usr/local/spark/sbin/start-slave.sh -h hp -c 2 spark://hadoop-master:7077

1.5min

===try2 each worker have 1 core.
/usr/local/spark/sbin/start-slave.sh -h hp -c 1 spark://hadoop-master:7077
/usr/local/spark/sbin/start-slave.sh -h hadoop-slave-1 -c 1 spark://hadoop-master:7077
/usr/local/spark/sbin/start-slave.sh -h hadoop-slave-2 -c 1 spark://hadoop-master:7077

total including jar copying, use 1.2min ~ 72s

===try3 add master as partial worker
/usr/local/spark/sbin/start-slave.sh -h hadoop-master -c 1 spark://hadoop-master:7077
total use 57s

===try3 4workers(1core) 215*128MB=26.875GB
total 2.2min=132seconds


*********************************
*                               *
* Try Seqfile Read performance  *
* sequence read                 *
* 2020-1-18                     *
*********************************
	ReadType	dura(sec)	spd(MB/s)
S	Key,Val		16.721	248.13
S	Only-Key	14.527	285.61
D	Key,Val		15.369	269.96
	Only-Key	14.484	286.45
	Key,Val		15.202	272.92
	Only-Key	14.567	284.82
	Key,Val		15.663	264.89
	Only-Key	14.741	281.46
			
D	Key,Val		76.547	54.20
I	Only-Key	71.993	57.63
S	Key,Val		77.488	53.54
K	Only-Key	73.136	56.73
	Key,Val		77.317	53.66
	Only-Key	73.527	56.43

Read total 4148.2MB seqfile. blockSize 128MB, each record 14.4MB.
Record count:288.
VMVare use SSD readspd 270MB/s, disk readspd 56MB/s.
Read Only-key use 1-2seconds less for SSD, 4-5seconds less for Disk.


****************************
*
* Try Seqfile Read performance
* random read with record interval x.
* 2020-1-18 
****************************
DISK	KV	49	42.33
inter2	K	44	47.14
	KV	46.5	44.60
	K	44.1	47.03
			
DISK	KV	25.3	40.99
inter4	K	21.8	47.57
	KV	24.3	42.68
	K	22.3	46.50
			
DISK	KV	1.04	498.58
inter8	K	0.76	682.27
	KV	1.36	381.27
	K	0.79	656.36
			
DISK	KV	0.66	392.82
inter16	K	0.65	398.87
	KV	0.71	365.16
	K	0.46	563.61
			
DISK	KV	0.16	180.04
inter	K	0.08	360.09
144	KV	0.2	144.03
	K	0.09	320.08


**************************************
**                                  **
**  应用自定义InputFormat            **
**  自定义RecordReader               **
**  自定义SeqFile存储Fy4 LST数据     **
**  自定义Stat对象存储Spark分析结果   **
**                                  **
**************************************
0.Fy4 4km LST(Int16)产品，一期全圆盘数据 2748*2748*2=14.40335MB 大约按14.5MB计。
一天按24小时，每5min一期，总计24h*12=288期，总计4.051GB 考虑包含syncmarker 大约按4.06GB计。
一天数据，普通SeqFile 33 个blocks。对齐后SeqfileB 36个blocks。
一个月普通121.5GB。

*** Section Y
Yarn 2020-1-21
Y-1.使用Yarn分析 一个月的数据121.5GB 8个节点：
yarn xxxx.jar param0 param1 param2...
Yarn默认启动了1个AM+31个Executor，一共32个容器。跑了540-580秒，215MB/s。
32个容器其中AM是随机选择的，并不一定在调用的节点或者Hadoop-Master节点。
默认Yarn通过每个容器的分配内存来计算全部可用容器数量，不考虑cpu的核数。
为了限制每个节点的容器数量，我给单个容器的内存增大到1500MB，这样每个节点
只有2个容器了。测试结果 700秒。每个节点2容器反而比4容器慢了，可能的解释为：
数据没有优先在本地处理，还有一个解释是启动和释放容器比较耗时间，2容器增加
了启动释放容器的时间。结论是Yarn怎么优化都不行，太慢了。

*** Section S
Spark 2020-1-22
使用spark提交一个Job
spark-submit --class com.company.Main --master spark://hadoop-master:7077 some.jar param0 param1 ...
使用spark分析121.5GB数据，每个节点采用2核和1核结果基本一致，结果如下（SEQ表示普通SeqFile）：
文件类型    总核数    耗时
SEQ        12        396s
SEQ        8         408-414s

问题及优化思路总结：
S-1.使用Seqfile文件（SEQ）保存一天的全部文件，使用Int作为键值（可以认为是观测时间），Bytes数组作为一期数据，
程序自动添加syncmarker（默认行为是100K加一个syncmarker，实际上除了第一条记录，后面每条记录都有syncmarker）。
每天SeqFile包含288条记录，共33个Block，存在跨Block的记录。每个记录的开始偏移值记录到文本文件中。

S-2.使用Spark进行Fy4数据（SEQ）统计分析时，spark处理的数据量增加到121.6GB（一个月30天数据）时，
大量数据影像存在跨block保存的情况，为了处理完整的影像，每个节点去其他节点抓取缺失的block。每个bock 128MB，各节点硬盘IO成为瓶颈。8节点统计一个月数据需要400秒。

S-3.Spark处理HDFS数据的性能分析。在HDFS中文件大小超过Block尺寸后会被分片，遥感数据以记录的形式存储在Block中。当数据记录处在两个Block的时候，Spark处理该任务的进程会读入两个Block，假设为Block1和Block2，共128MBx2=256MB的数据。有意思的是，由于进入新的Block，Spark在连处理下一条连续记录的时候，作为一个独立的子任务会新开一个进程，同时该进程并不会使用前面已经加载的Block2，而是从新读入Block2（此处应该有硬件层级的缓存），同时读入该Block2最后一条记录时遇到跨Block记录时，会继续读入Block3，那么本进程实际的IO仍然是256MB。可以看出Block2实际被读取了两次。
S-4.为了优化掉这个无效时间，我在把数据写入HDFS时严格保证每条记录不会出现在两个Block中，受限于HDFS底层API的限制，每个记录出现在跨Block的时候，不得不给上一条记录追加填充值，将这个记录完全挤到后一个Block中，我给这个技术命名为块对齐顺序文件存储（Block-Aligned Sequence File Storage），简写为BSEQ。虽然增加额外的无效的存储空间（一天数据量从4.05GB增加到4.5GB），但是经过测试，一天SeqFile处理时间可以缩短33%以上（见下表）。

使用T420i7虚拟机(机械硬盘测试)测试性能一天数据量：
文件类型		总核数		数据量		耗时
SEQ 		1		4.05GB 		120-126s         
BSEQ 		1		4.5GB 		90s

2020-1-23 在8节点测试BSEQ性能，分析一天与一个月数据。
文件类型		总核数		数据量		耗时
BSEQ 		9		4.5GB 		13-14s         
BSEQ 		9		135GB 		336-342s
BSEQ 		8		135GB		330-335s

模拟业务环境下一个月的风云四地表温度产品，大约135GB。对整月数据统计最大，最小和平均，最后八节点集群耗时330秒，比昨天没进行对齐的数据的处理时间缩短了70秒钟。平均每秒处理418MB，单节点平均52MB。虽然速度有提高，但是还是不够理想。增加核数并不能增加处理速度。

*** Section C
Custom Fy4 FileInputFormat RecordReader
C-1.针对SEQ,BSEQ进行MapReduce分析，由于默认的SeqFile的Mapreduce API
没法进行随机读取，只能顺序访问，因为用的是hadoop默认的FileInputFomat和RecordReader，
在Spark里也只能顺序访问，，那么即使分析一条记录，在map阶段也要遍历全部的记录。

C-2.使用普通BSEQ进行mapred均值分析，
一天数据4.5GB，block对齐，spark对结果take只取1个随机结果，
此时spark框架在map阶段访问了全部记录。耗时78-90秒。
因为要做reduce所以访问几条记录都会对全部记录进行map操作。

attension!!! 
C-3.使用seek定位seq文件中的记录时，如果这条记录包含sync marker，那么seek的地址就在sync marker位置上而不是记录的地址上。比如一个记录包含syncmarker，其sync地址是94，记录地址是114，
那么要seek到这条记录需要seek(94)。初步测试直接定位到对应记录需要10-11秒。
在Map阶段，分配到每个节点的Fy4RecordReader自己都没法带参数，每个节点的RecordReader都是初始状态，
为了不分析无效数据，必须让Fy4RecordReader获取记录偏移值。
但是如何动态修改recordreader的启示记录位置尚没有找到方法(已经找到方法见下文)。

attension!!! 使用jobConf携带master中的参数
C-4.可以通过sc.hadoopRDD中的jobconfig参数将自己的参数带到inputformat和recordreader里面，而且不受运行节点在哪里的限制。夹带私货十分的方便。


**************************************
**                                  **
**  Spark Streaming                 **
**                                  **
**************************************
Spark流处理
2020-1-27
使用流处理可以省略掉启动spark集群和各worker的时间，通过http拿到需要分析的键值，
直接到hdfs定位数据进行处理分析并返回结果是必然的路子。

2020-1-30 初步SparkStreaming研究城固偶：
attension!!!
1.如果在单机模拟流计算，至少需要给运行流计算的机器分配两个worker或者2 cores。
start-slave.sh -h ubuntu -c 2 spark://ubuntu:7077
只有一个core的话，SparkStreaming就使用这个core监听端口不做其他事情，
streaming接收到的任务一直在排队永远不会被处理。
2.运行流计算的命令：
首先要启动一个http流的服务进程，使用netcat，nc命令；然后提交spark任务。
如果没有对应端口的服务进程，java会返回port拒绝服务的异常。
在一个终端启动一个http服务：
>nc -l ubuntu 7777

在另一个终端提交spark任务
>spark-submit --class com.company.Main --master spark://ubuntu:7077 some.jar param0 param1....

此时在第一个终端可以输入字符串流，spark的终端会接收并处理。
重要：spark运行节点至少要有两个核可以使用，否则只接收7777的输入而不处理！！
如果只想用本地模式处理，可以在代码中给config项配置setMaster("Local[2]")。

下一步测试spark join转换与流计算配合使用，http只负责传递关键字，流计算中DStream转换函数负责定位hdfs数据资源进行处理，结果再返回给http流或者队列（MQ）服务。

2020-2-1 12:00 一点感想思路
SparkStreaming还有一个简单问题需要验证。在Driver程序了，初始化遥感数据RDD数据集，然后每个批次的Streaming RDD读入关键字和运算操作，将Streaming中的关键字与RDD关联，然后对RDD进行运算操作，结果以二进制流返回或者保存临时文件。需要检验，上面所说的RDD是否是分布式保存的？还是RDD只在一个节点保存。

一点测试工作
通过提交一个1024*1024*2 120条记录的测试数据验证了，使用streaming 关键字关联RDD的操作可行。
在单节点使用sparkStreaming验证原型程序的时候需要注意：
1.遇到下面异常时，表示程序运行过程中有Executor在运行过程中崩溃了，导致崩溃的原因很可能时分配内存不足。
org.apache.spark.shuffle.MetadataFetchFailedException: 
Missing an output location for shuffle 4
2.进行sparkStreaming测试时，至少需要3个Executor，一个负责监控流，一个负责Map，一个负责归并（reduce）操作。
假设如果单机分配了4个核，而spark-env.sh SPARK_EXECUTOR_MEMORY 配置了1024m的内存，那么该机器可用内存必须
大于4*1024m，否则spark运行中会出现executor无法得到内存而崩溃的错误。这里我的vm虚机为2核4GB，为了验证测试程序
我给该单节点slave启动了4个核，配置SPARK_EXECUTOR_MEMORY=512m ， 这样在sparkStreaming运行时保证至少有2GB
可用内存即可完成验证工作。


**************************************
**                                  **
**  Spark 缓存问题                   **
**                                  **
**************************************
仔细看log日志发现240MB（两个blocks）都没有被executor缓存成功，
发现默认启动executor的时候，executor使用512m内存，但是实际用于缓存的MemoryStore只有93.3MB，
这个93.3也不知道怎么算的，大概是512m的 18%。
以下是相关配置参数的实验：
spark-conf executor-memory 512m
start-slave -c 2
spark-submit 
web-ui worker shows 2 executors each has 512MB, 
executor log shows MemoryStore 93.3MB

spark-conf executor-memory 512m
start-slave -c 2
spark-submit --driver-memory 1g
web-ui worker shows 2 executors each has 512MB,
executor log shows MemoryStore 93.3MB

spark-conf executor-memory 512m
start-slave -c 2
spark-submit --executor-memory 1g
web-ui worker shows 2 executors each has 1024MB, 
executor log shows MemoryStore 366.3MB  
20/02/05 22:46:06 INFO MemoryStore: Block rdd_1_0 stored as values in memory (estimated size 120.0 MB, free 246.0 MB)
20/02/05 22:46:08 INFO MemoryStore: Block rdd_1_1 stored as values in memory (estimated size 120.0 MB, free 126.0 MB)

spark-conf executor-memory 512m
start-slave -c 2
spark-submit --executor-memory 512m
web-ui worker shows 2 executors each has 512MB, 
executor log shows MemoryStore 93.3MB 
20/02/05 22:55:45 WARN MemoryStore: Not enough space to cache rdd_1_0 in memory! (computed 66.0 MB so far)
20/02/05 22:55:45 INFO MemoryStore: Memory use = 355.5 KB (blocks) + 51.0 MB (scratch space shared across 1 tasks(s)) = 51.3 MB. Storage limit = 93.3 MB.
20/02/05 22:55:46 WARN MemoryStore: Not enough space to cache rdd_1_1 in memory! (computed 66.0 MB so far)
20/02/05 22:55:46 INFO MemoryStore: Memory use = 361.4 KB (blocks) + 51.0 MB (scratch space shared across 1 tasks(s)) = 51.4 MB. Storage limit = 93.3 MB.

spark-conf executor-memory 512m
start-slave -c 2
spark-submit --executor-memory 768m
web-ui worker shows 2 executors each has 768MB, 
executor log shows MemoryStore 229.8MB
20/02/05 23:00:25 INFO MemoryStore: Block rdd_1_0 stored as values in memory (estimated size 120.0 MB, free 109.5 MB)
20/02/05 23:00:26 INFO MemoryStore: Will not store rdd_1_1
20/02/05 23:00:26 WARN BlockManager: Block rdd_1_1 could not be removed as it was not found on disk or in memory 

通过实验发现启动spark-submit的时候配置--executor-memory 1g 才能得到366.3MB 的空间，
这样每个executor可以缓存240MB的RDD（executor0 缓存两个block，但是executor1 是否只缓存1个block吗？？需要考察以下）。
exec-mem(em)    memstore(ms)   percent(ms/em)
512             93.3           18%
768             229.8          30%
1024            366.3          36%
linear fit:
MemStore=ExecMem*0.53 - 179.7
好像是EM全部内存的一半，然后减去180MB.

********************************************
**                                        **
**  Spark关于小DStream数据集与大的         **
**  HDFS RDD数据集匹配（join）问题的实验    **
**                                        **
********************************************
7. Task7StreamingFy4Join
  use http pass record-id, DStream join RDD, output statistic.
  use /teststreaming120.seq to test RDD.

	rdd0 = sequencefile(...)
	rdd1 = rdd0.mapToPair(...)

	7.2020-2-7.1 try print default rdd1 part num, rdd1 = foreach part print record key.
				result: check exec0's log part0[0,1,....59]; exec1's log part1[60,61,...159].

	7.2020-2-7.2 try rdd2 = rdd1.repartition(2) , rdd2.foreach part print record key.
				result: as my imagine. exec0's log part0[1,3,5,7...119]; exec1's log part1[0,2,4,6,8...118]

	7.2020-2-7.3 try rdd2j = rdd1.join(dstream.rdd), rdd2j.foreach part print record key.(dstream use repart(2))
				result: exec0 no processing, it seems only for receicing stream. exec1'log partX[4,6,8,2] partY[1,9,3,7,7,5].
				rdd1 does not repartitioned, so i guess, join-transform make rdd1 shuffled, it cost times.

	7.2020-2-7.31 try rdd1=rdd0 stream1=stream0.repart(2), rdd1.join(stream1.rdd), see DAG1.
				result: partX[4,6,8,2] partY[1,9,3,7,7,5]

	7.2020-2-7.32 try rdd1=rdd0.repart(2) stream1=stream0.repart(2), rdd1.join(stream1.rdd), see DAG2.
				result: partX[4,6,8,2] partY[1,9,3,7,7,5]

	7.2020-2-7.33 try compare DAG1 with DAG2

	7.2020-2-7.34 try rdd1=rdd0 stream1=stream0 rdd1.join(stream1.rdd), see DAG3.
			result: it seems very simple DAG, no repartion task.
				

	7.2020-2-7.4 try compare log for rdd1.cache or not. the executor with 1g mem, enough for cache 240m data. 
		no cache log:
every batch shows reading seqfile.... cost 1-3 seconds.
20/02/07 18:04:00 INFO MemoryStore: Block broadcast_26 stored as values in memory (estimated size 3.9 KB, free 365.9 MB)
20/02/07 18:04:00 INFO HadoopRDD: Input split: hdfs://ubuntu:9000/teststreaming120.seq:0+125831317
20/02/07 18:04:00 INFO Executor: Finished task 0.0 in stage 33.0 (TID 112). 2290 bytes result sent to driver
20/02/07 18:04:03 INFO CoarseGrainedExecutorBackend: Got assigned task 113
20/02/07 18:04:03 INFO Executor: Running task 1.0 in stage 33.0 (TID 113)
20/02/07 18:04:03 INFO HadoopRDD: Input split: hdfs://ubuntu:9000/teststreaming120.seq:125831317+125831317
20/02/07 18:04:04 INFO Executor: Finished task 1.0 in stage 33.0 (TID 113). 1475 bytes result sent to driver
20/02/07 18:04:04 INFO CoarseGrainedExecutorBackend: Got assigned task 114
		
		with cache log:
only shows once reading seqfile...
20/02/07 17:45:57 INFO Executor: Running task 1.0 in stage 4.0 (TID 74)
20/02/07 17:45:57 INFO HadoopRDD: Input split: hdfs://ubuntu:9000/teststreaming120.seq:125831317+125831317
20/02/07 17:45:58 INFO MemoryStore: Block rdd_1_1 stored as values in memory (estimated size 120.0 MB, free 125.9 MB)
20/02/07 17:45:59 INFO Executor: Finished task 1.0 in stage 4.0 (TID 74). 2197 bytes result sent to driver
20/02/07 17:45:59 INFO CoarseGrainedExecutorBackend: Got assigned task 75
each batch use rdd_1_0 or rdd_1_1 object instead of seqfile.... 
20/02/07 17:47:00 INFO Executor: Running task 0.0 in stage 26.0 (TID 102)
20/02/07 17:47:00 INFO TorrentBroadcast: Started reading broadcast variable 20
20/02/07 17:47:00 INFO MemoryStore: Block broadcast_20_piece0 stored as bytes in memory (estimated size 2.4 KB, free 125.9 MB)
20/02/07 17:47:00 INFO TorrentBroadcast: Reading broadcast variable 20 took 15 ms
20/02/07 17:47:00 INFO MemoryStore: Block broadcast_20 stored as values in memory (estimated size 3.9 KB, free 125.9 MB)
20/02/07 17:47:00 INFO BlockManager: Found block rdd_1_0 locally
20/02/07 17:47:00 INFO Executor: Finished task 0.0 in stage 26.0 (TID 102). 2217 bytes result sent to driver
20/02/07 17:47:00 INFO CoarseGrainedExecutorBackend: Got assigned task 103
20/02/07 17:47:00 INFO Executor: Running task 1.0 in stage 26.0 (TID 103)
20/02/07 17:47:00 INFO BlockManager: Found block rdd_1_1 locally
20/02/07 17:47:00 INFO Executor: Finished task 1.0 in stage 26.0 (TID 103). 1489 bytes result sent to driver

Attension you can compare task_0 and task_1 in the logs above.
[Note!!] the logs show the cache do save a lot time, without read seqfile repeatly.


********************************************
**                                        **
**  Spark区域统计功能开发                  **
**  SeqFile数据压缩的实验                  **
**       jar打包的实验                     **
**                                        **
********************************************
1.区域统计开发


2.数据压缩
fy4 lst数据使用short保存，日数据40天不压缩900MB，压缩后80MB。计算的效率还是压缩后占优势，压缩的计算消耗13-17秒，不压缩的计算35-40秒。


3.jar打包
打包jar的时候如果包含了spark.*.jar, scalar.*.jar 程序在26MB以上；如果在打包的时候去掉spark.*.jar, scalar.*.jar，那么打包后的jar只有6.7MB。使用spark-submit提交jar运行的时候，程序可以正常运行，不会出现缺少库的报错，同时结果也是正确的。
缩小jar包以后提交jar的时间显著减少了，尤其是计算任务本身就在10秒左右的时候。


********************************************
**                                        **
**        安装HBase                       **
**        2020-3-24                       **
**                                        **
********************************************

Hadoop-2.7.1+ 使用HBase1.3.x,HBase1.4.x,HBase2.1.x

气候中心管理大量数据离不开数据库，从hdfs转向hbase存储。
1.安装包下载
https://hbase.apache.org/downloads.html
https://mirrors.tuna.tsinghua.edu.cn/apache/hbase/hbase-1.3.6/hbase-1.3.6-bin.tar.gz
v1.3.6 
安装步骤
https://www.guru99.com/hbase-installation-guide.html

2.
tar -zxvf hbase-1.3.6-bin.tar.gz
sudo mv hbase-1.3.6 /usr/local/hbase
sudo chmod 777 -R /usr/local/hbase
nano /usr/local/hbase/conf/hbase-env.sh
edit JAVA_HOME
export JAVA_HOME=/usr/local/jdk1.8.0_192

mkdir ~/hbase
mkdir ~/zookeeper

--- ---
nano /usr/local/hbase/conf/hbase-site.xml
<configuration>
	<property>
		<name>hbase.rootdir</name>
		<value>file:/home/hadoop/hbase</value>
	</property>
	<property>
		<name>hbase.zookeeper.property.dataDir</name>
		<value>/home/hadoop/zookeeper</value>
	</property>
</configuration>
--- --- ---

nano ~/.bashrc
export PATH=$PATH:/usr/local/hbase/bin
source ~/.bashrc

启动HBase
start-hbase.sh

jps产看守护进程
如果hbase启动失败，考虑是不是虚拟机的ip地址变化了，或者查看/etc/hosts文件中主机名和ip地址是否有不妥的地方。
要访问 HBase 的 Web界面，在浏览器中键入以下URL
http://localhost:16010

以上单节点安装完成--- --- --- ---

********************************************
**                                        **
**        配置HBase分布式集群              **
**        2020-3-24                       **
**                                        **
********************************************
接上面,分布式集群依赖zookeeper，配置项增加zookeeper内容。

nano /usr/local/hbase/conf/hbase-site.xml
<configuration>
	<property>
	   <name>hbase.cluster.distributed</name>
	   <value>true</value>
	</property>
	<property>
		<name>hbase.rootdir</name>
		<value>hdfs://ubuntu:9000/hbase</value>
	</property>
	<property>
		<name>hbase.zookeeper.property.dataDir</name>
		<value>/home/hadoop/zookeeper</value>
	</property>
	<property>
		<name>dfs.replication</name>
		<value>1</value>
	</property>
	<property>
	  <name>hbase.zookeeper.quorum</name>
	  <value>node-a.example.com,node-b.example.com,node-c.example.com</value>
	</property>
</configuration>

【注意】hbase.zookeeper.quorum这里应该把每个节点都配到zookeeper里面。
复制hbase-site.xml到每个节点上，注意检查zookeeper中的节点名字是否有主机名字一致。

修改regionservers这个文件，将每个节点运行regionserver的主机名列进去，这个文件也要拷贝到每个节点里面，所以其实是要拷贝hbase/conf目录到每个节点。
ubuntu

使用jps命令检查daemon进程，如果
HMaster，HRegionServer，HQuorumPeer其中任何一个不存在，说明hbase没有正常启动，
那就重启虚拟机吧，肯能虚机ip地址内部冲突了。

如果出现错误
java.io.IOException: Could not find my address: localhost in list of ZooKeeper quorum servers
说明zookeeper没有启动，后来发现是/etc/hosts文件中localhost在ubuntu的前面。
关闭hbase，关闭hdfs。
修改/etc/hosts
127.0.0.1 ubuntu
127.0.0.1 localhost
...
...

保存后，重新启动hdfs，启动hbase。
然后jps
HRegionServer
NameNode
DataNode
Jps
HQuorumPeer
HMaster
都有了！

hbase网站也可以了
http://ubuntu:16010/master-status
each RegionServer to 16010 for the Master and 16030 for the RegionServer.

启动shell
> hbase shell
hbase> create 'tablename' , 'columnfamilyname'
hbase> put 'tablename' , 'row-1' , 'columnfamilyname:name' , 'hello'
hbase> put 'tablename' , 'row-1' , 'columnfamilyname:age'  ,  32
hbase> put 'tablename' , 'row-1' , 'columnfamilyname:city' , 'beijing'
hbase> scan 'tablename'
...
...
hbase> get 'tablename' , 'row-1'
...
hbase> get 'tablename' , 'row-1' , 'columnfamilyname:name'
...

【注意】
linux发行版会限制单一用户同时打开文件的数量，一般是1024.
通过命令ulimit -n可以查询具体数字。
hbase通常需要同时打开大量文件，具体数字和Region和ColumnFamily数量有关，一般建议配置10240个文件。
最大进程数也有同样的限制。

edit /etc/security/limits.conf
hadoop  -       nofile  32768
hadoop  -       nproc   32000

be sure that the /etc/pam.d/common-session file contains the following line:
session required  pam_limits.so

【注意】
hbase的hadoop jar版本可能低于集群的hadoop版本，所以为了确保集群不会出现版本冲突，尽量使用运行的hadoop jar替换hbase中的jar。

hfds有同时最大服务文件上限
make sure you have configured Hadoop’s conf/hdfs-site.xml, setting the dfs.datanode.max.transfer.threads value to at least the following:
<property>
  <name>dfs.datanode.max.transfer.threads</name>
  <value>4096</value>
</property>
Be sure to restart your HDFS after making the above configuration.

zookeeper只能有奇数个节点，那么RegionServer也必须是奇数？？好奇怪的设定。
HMaster是否需要zookeeper呢？


hbase.server.keyvalue.maxsize
The default value is 10MB

hbase.hregion.max.filesize
一个Region拆分的上限，默认10GB，每个HFile达到10GB的时候进行拆分。

注意，hbase的version属性是用来决定cell的生存时间的，最好避免把它当成一个通用属性。
Caution: the version timestamp is used internally by HBase for things like time-to-live calculations. It’s usually best to avoid setting this timestamp yourself. Prefer using a separate timestamp attribute of the row, or have the timestamp as a part of the row key, or both.


29.4.2. Major compactions change query results
…​create three cell versions at t1, t2 and t3, with a maximum-versions setting of 2. So when getting all versions, only the values at t2 and t3 will be returned. But if you delete the version at t2 or t3, the one at t1 will appear again. Obviously, once a major compaction has run, such behavior will not be the case anymore…​ (See Garbage Collection in Bending time in HBase.)


75. Storing Medium-sized Objects (MOB)
Hbase适合存100KB以下的数据，100KB-10MB是MOD数据（中等尺寸对象数据缩写），针对MOD数据读写访问需要做一些配置才能提升性能
Data comes in many sizes, and saving all of your data in HBase, including binary data such as images and documents, is ideal. While HBase can technically handle binary objects with cells that are larger than 100 KB in size, HBase’s normal read and write paths are optimized for values smaller than 100KB in size. When HBase deals with large numbers of objects over this threshold, referred to here as medium objects, or MOBs, performance is degraded due to write amplification caused by splits and compactions. When using MOBs, ideally your objects will be between 100KB and 10MB (see the FAQ). HBase 2 added special internal handling of MOBs to maintain performance, consistency, and low operational overhead. MOB support is provided by the work done in HBASE-11339. To take advantage of MOB, you need to use HFile version 3. Optionally, configure the MOB file reader’s cache settings for each RegionServer (see Configuring the MOB Cache), then configure specific columns to hold MOB data. Client code does not need to change to take advantage of HBase MOB support. The feature is transparent to the client.
75.1. Configuring Columns for MOB
You can configure columns to support MOB during table creation or alteration, either in HBase Shell or via the Java API. The two relevant properties are the boolean IS_MOB and the MOB_THRESHOLD, which is the number of bytes at which an object is considered to be a MOB. Only IS_MOB is required. If you do not specify the MOB_THRESHOLD, the default threshold value of 100 KB is used.

When memstores from a column family configured to use MOB are eventually flushed two hfiles are written simultaneously.
a MOD hfile
a reference hfile
HBase写MOD cell的时候会同时写入两个hfile，一个存储数据，另一个存储元信息。
The value of the reference cell is made up of two pieces of metadata: the size of the actual value and the MOB hfile that contains the original cell.

75.4. MOB Optimization Tasks
75.4.2. Configuring the MOB Cache
<name>hbase.mob.file.cache.size</name>

读到
43.1. Filter Query 晚上继续

46.4. Optimize on the Server Side for Low Latency
After configuring your hadoop setting dfs.client.read.shortcircuit to true and configuring the dfs.domain.socket.path path for the datanode and dfsclient to share and restarting, next configure the regionserver/dfsclient side.
In hbase-site.xml, set the following parameters:
dfs.client.read.shortcircuit = true
dfs.client.read.shortcircuit.skip.checksum = true so we don’t double checksum
dfs.domain.socket.path to match what was set for the datanodes.
dfs.client.read.shortcircuit.buffer.size = 131072 Important to avoid OOME

48. HBase, MapReduce, and the CLASSPATH 读到这里


************************************************
*                                              *
*       Install JDK+Hadoop+HBASE+SCALA+SPARK   *
*               2020-4-5                       *
*       update 2021-3-31 增加同步时间设置      *
************************************************
由于hbase需要集群每个机器时间同步所以，配置集群前最后同步所有机器的时间
切换时区
timedatectl set-timezone 'Asia/Shanghai'

设置时间
date -s "16:51:00 CST"
设置日期时间
date -s "2021-03-31 16:51:00 CST"

设置时间后需要写入cmos
clock -w

逐一检查每台机器的时间，确保误差在1秒钟以内。

*** 注意提前检查一下每个机器的防火墙状态，如果防火墙开启可能会出现特定节点连接不上的问题。 ****


需要全部替换scala2.11.6版本，2.12版本与spark+hbase connector兼容不好。
sudo apt-get update
sudo apt-get install ssh
{
	IF some lock error happens , use this:
	sudo rm /var/cache/apt/archives/lock
	sudo rm /var/lib/dpkg/lock
}
sudo apt-get install openssh-server
sudo apt-get install rsync
mkdir ~/.ssh
cd ~/.ssh
ssh-keygen -t rsa
一直按enter键
cat ./id_rsa.pub >> ./authorized_keys  
ssh localhost 
when ask[yes/no] use yes.


sudo nano /etc/hosts
192.168.10.220 ubuntu
192.168.10.232 hp1
192.168.10.203 hp2
192.168.10.193 hp3

[comment]every nodes copy pub key to other nodes.
ssh-copy-id -i ~/.ssh/id_rsa.pub hadoop@hp1
ssh-copy-id -i ~/.ssh/id_rsa.pub hadoop@hp2
ssh-copy-id -i ~/.ssh/id_rsa.pub hadoop@hp3
ssh-copy-id -i ~/.ssh/id_rsa.pub hadoop@ubuntu
ssh-copy-id -i ~/.ssh/id_rsa.pub hadoop@master

chmod 0600 ~/.ssh/authorized_keys


mkdir ~/datadir
mkdir ~/namedir
mkdir ~/tempdir
mkdir ~/hbase
mkdir ~/zookeeper
mkdir ~/spark

tar -zxvf jdk-8u192-linux-x64.tar.gz
tar -zxvf hadoop-2.9.2.tar.gz
tar -zxvf hbase-2.2.4-bin.tar.gz
tar -xvf scala-2.11.6.tgz
tar -xvf spark-2.4.5-bin-hadoop2.7.tgz 

sudo mv jdk1.8.0_192 /usr/local
sudo mv hadoop-2.9.2 /usr/local/hadoop
sudo mv hbase-2.2.4 /usr/local/hbase
sudo mv scala-2.11.6 /usr/local/scala
sudo mv spark-2.4.5-bin-hadoop2.7 /usr/local/spark

chmod 777 /usr/local/hadoop
chmod 777 /usr/local/hbase
chmod 777 /usr/local/spark
chmod 777 /usr/local/scala

nano /usr/local/hadoop/etc/hadoop/hadoop-env.sh
export JAVA_HOME=/usr/local/jdk1.8.0_192

nano /usr/local/hadoop/etc/hadoop/core-site.xml
<configuration>
	<property>
		<name>hadoop.tmp.dir</name>
		<value>/home/hadoop/tempdir</value>
	</property>
	<property>
		<name>fs.defaultFS</name>
		<value>hdfs://ubuntu:9000</value>
	</property>
</configuration>

nano /usr/local/hadoop/etc/hadoop/hdfs-site.xml
<configuration>
    <property>
        <name>dfs.replication</name>
        <value>1</value>
    </property>
    <property>
        <name>dfs.block.size</name>
        <value>134217728</value>
    </property>
    <property>
        <name>dfs.namenode.name.dir</name>
        <value>file:/home/hadoop/namedir</value>
    </property>
    <property>
        <name>dfs.datanode.data.dir</name>
        <value>file:/home/hadoop/datadir</value>
    </property>
</configuration>

nano /usr/local/hadoop/etc/hadoop/slaves
hp1
hp2
hp3


nano /usr/local/hbase/conf/hbase-env.sh
export JAVA_HOME=/usr/local/jdk1.8.0_192

nano /usr/local/hbase/conf/hbase-site.xml
<configuration>
	<property>
	   <name>hbase.cluster.distributed</name>
	   <value>true</value>
	</property>
	<property>
		<name>hbase.rootdir</name>
		<value>hdfs://ubuntu:9000/hbase</value>
	</property>
	<property>
		<name>hbase.zookeeper.property.dataDir</name>
		<value>/home/hadoop/zookeeper</value>
	</property>
	<property>
		<name>dfs.replication</name>
		<value>1</value>
	</property>
	<property>
	  <name>hbase.zookeeper.quorum</name>
	  <value>ubuntu</value>
	</property>
	<property>
	  <name>hbase.unsafe.stream.capability.enforce</name>
	  <value>false</value>
	</property>
</configuration>

nano /usr/local/hbase/conf/regionservers
hp1
hp2
hp3

cp /usr/local/spark/conf/spark-env.sh.template /usr/local/spark/conf/spark-env.sh
cp /usr/local/spark/conf/spark-defaults.conf.template /usr/local/spark/conf/spark-defaults.conf
cp /usr/local/spark/conf/slaves.template /usr/local/spark/conf/slaves


nano /usr/local/spark/conf/spark-env.sh
export SCALA_HOME=/usr/local/scala
export JAVA_HOME=/usr/local/jdk1.8.0_192
export SPARK_MASTER_HOST=ubuntu
export SPARK_MASTER_PORT=7077
export SPARK_WORKER_DIR=/home/hadoop/spark
export SPARK_EXECUTOR_INSTANCES=2
export SPARK_EXECUTOR_CORES=1
export SPARK_EXECUTOR_MEMORY=1024m
export SPARK_LOCALITY_WAIT=60s
export SPARK_LOCALITY_WAIT_PROCESS=0s

nano /usr/local/spark/conf/spark-defaults.conf
spark.master  spark://ubuntu:7077

nano /usr/local/spark/conf/slaves
hp1
hp2
hp3

rsync -a /usr/local/hadoop/etc/hadoop/ hadoop@hp1:/usr/local/hadoop/etc/hadoop
rsync -a /usr/local/hadoop/etc/hadoop/ hadoop@hp2:/usr/local/hadoop/etc/hadoop
rsync -a /usr/local/hadoop/etc/hadoop/ hadoop@hp3:/usr/local/hadoop/etc/hadoop
rsync -a /usr/local/hbase/conf/ hadoop@hp1:/usr/local/hbase/conf
rsync -a /usr/local/hbase/conf/ hadoop@hp2:/usr/local/hbase/conf
rsync -a /usr/local/hbase/conf/ hadoop@hp3:/usr/local/hbase/conf
rsync -a /usr/local/spark/conf/ hadoop@hp1:/usr/local/spark/conf
rsync -a /usr/local/spark/conf/ hadoop@hp2:/usr/local/spark/conf
rsync -a /usr/local/spark/conf/ hadoop@hp3:/usr/local/spark/conf


nano ~/.bashrc 
export JAVA_HOME=/usr/local/jdk1.8.0_192
export PATH=$PATH:$JAVA_HOME/bin
export PATH=$PATH:/usr/local/hadoop/bin:/usr/local/hadoop/sbin
export PATH=$PATH:/usr/local/hbase/bin
export PATH=$PATH:/usr/local/scala/bin 
export PATH=$PATH:/usr/local/spark/bin:/usr/local/spark/sbin


source ~/.bashrc

[first time use >hadoop namenode -format]

start-dfs.sh
start-hbase.sh
/usr/local/spark/sbin/start-all.sh

>hbase shell
Syntax: create <tablename>, <columnfamilyname>
Syntax:  put <'tablename'>,<'rowname'>,<'columnvalue'>,<'value'>

************************************************
*                                              *
*      Java Hbase API 写入数据实验              *
*               2020-4-13                      *
************************************************

...
for( i=0:128 )
table.put( put1 ) ; // 一条一条写入的时候
...

一条一条写入的时候，写入128个记录耗时15,000ms - 23,000ms


List<Put> listPuts = new ArrayList<Put>() ;
... ... // 128 records
table.put(listPuts);

上面一次写入128个记录耗时8100ms ~ 1300ms

... ...
BufferedMutator tablem = conn.getBufferedMutator(bmConfig);
... ... 
tablem.put(listPuts);
writeBufferSize设置位10MB的时候，上面一次写入，需要8500ms - 14,000ms
writeBufferSize设置位50MB的时候，上面一次写入，需要8500ms - 12,00ms
writeBufferSize设置位100MB的时候，上面一次写入，需要8500ms - 20,000ms
writeBufferSize设置位200MB的时候，上面一次写入，需要7000ms - 10,000ms


https://stackoverflow.com/questions/45865388/hbase-bufferedmutator-vs-putlist-performance
Short Answer
BufferedMutator generally provides better throughput than just using Table#put(List<Put>) but needs proper tuning of hbase.client.write.buffer, hbase.client.max.total.tasks, hbase.client.max.perserver.tasks and hbase.client.max.perregion.tasks for good performance.

连上网线以后
writeBufferSize设置位200MB的时候，mutate(list)写入128个记录的时间为 1889ms 约2秒钟，传输64万瓦片约2.7小时。


************************************************
*                                              *
*      编译HBase Spark Connector               *
*               2020-4-22                      *
************************************************
Spark 2.4.5
scala 2.12.11
https://github.com/apache/hbase-connectors

install hbase spark connector
1.download
https://github.com/apache/hbase-connectors
2.unzip
3.cd ~/Downloads/hbase-connectors-master/spark
4.install maven
sudo apt-get install maven
above version too old to use.

download maven apache-maven-3.6.3-bin.tar.gz
tar xzvf apache-maven-3.6.3-bin.tar.gz
sudo mv apache-maven-3.6.3 /usr/local/maven
chmod 777 /usr/local/maven
nano ~/.bashrc
export PATH=$PATH:/usr/local/maven/bin
source ~/.bashrc
mvn -v


5.run build
$ mvn -Dspark.version=2.4.5 -Dscala.version=2.12.11 -Dscala.binary.version=2.12 clean install

$ mvn -Dspark.version=2.4.5 -Dscala.version=2.11.6 -Dscala.binary.version=2.11 clean install

多次build失败，提示java.net.BindException: Can't assign requested address: Service 'sparkDriver'
问题出在TestJavaHBaseContext.java 中86行：
JSC = new JavaSparkContext("local", "JavaHBaseContextSuite");
该行用于构建一个本地spark环境。
错误原因，我推测是因为/etc/hosts中本机地址是之前配的集群地址 192.168.10.220 ubuntu
我在第一条加了127.0.0.1 ubuntu 然后重新mvn构建，至少在Apache HBase - Spark Connector这一步构建成功了，
但是Apache HBase - Spark Integration Tests又失败了。先这样吧。
目前使用这个hbase-spark-1.0.1-snapshot.jar是可以用的，构建项目的时候引用这个jar然后打包jar，应用spark-submit可以用的。
注意在你的项目代码里要设置zookeeper的主机名。


************************************************
*                                              *
*      Spark HBase 使用原生HBaseContext开发     *
*               2020-4-22                      *
************************************************
这个HBaseContext也是JavaHBaseContext就是Cloudera公司的HBaseOnSpark，融入官方的代码库，但是需要单独编译。
打包hbase和spark的jar后，我的jar 150MB。
spark-submit后提示
java.lang.ClassNotFoundException: scala.runtime.LambdaDeserialize 错误
原因我机器安装的scala是2.12版本 ， 但是我java项目用的是2.11版本。
解决方案，全部使用scala2.11。


hbase-spark-1.0.1-snapshot.jar干嘛用的？？ 这个jar就是connector本尊。

尝试缩小业务jar包的大小，去掉hbase有关的jar：
把HBase hbase/lib目录下全部jar包和hbase/lib/client-facing-thirdparty那个目录下的jar都拷贝到每个节点的spark/jars目录下。(这个不成)
spark-submit Task12TestSparkHBase.jar
尝试把所有jar考到hadooop/share/hadoop/common (这个也不灵)

- ./spark-submit with –driver-class-path to augment the driver classpath 
- spark.executor.extraClassPath to augment the executor classpath （这个也不行）


BTW: you should be using the spark.[driver|executor].extraClassPath settings as that is the current way to do this.
--conf "spark.driver.extraClassPath=/opt/cloudera/parcels/CDH-5.4.7-1.cdh5.4.7.p0.3/jars/htrace-core-3.1.0-incubating.jar:/opt/cloudera/parcels/CDH-5.4.7-1.cdh5.4.7.p0.3/lib/hive/conf:/opt/cloudera/parcels/CDH-5.4.7-1.cdh5.4.7.p0.3/lib/hive/lib/*.jar" \
--conf "spark.executor.extraClassPath=/opt/cloudera/parcels/CDH-5.4.7-1.cdh5.4.7.p0.3/jars/htrace-core-3.1.0-incubating.jar:/opt/cloudera/parcels/CDH-5.4.7-1.cdh5.4.7.p0.3/lib/hive/conf:/opt/cloudera/parcels/CDH-5.4.7-1.cdh5.4.7.p0.3/lib/hive/lib/*.jar" \


spark-submit --driver-class-path /usr/local/hbase/lib --driver-library-path /usr/local/hbase/lib --executor-library-path /usr/local/hbase/lib Task12TestSparkHBase (不行)

下面是可行的方式：
编辑spark-defaults.conf
增加spark.jars配置项，每个jar都用逗号分隔，要是绝对路径。
spark.jars /usr/local/spark/jars/scala-compiler-2.11.12.jar,/usr/local/spark/jars/scala-library-2.11.12.jar,...,/usr/local/spark/jars/spark-streaming_2.11-2.4.5.jar

************************************************
*                                              *
*      关于Spark HBase并行加载计算               *
*               2020-4-23                      *
************************************************
参考这个帖子有代码
https://stackoverflow.com/questions/52081865/how-to-do-parallel-reading-from-hbase-using-list-of-rowkeys-in-spark
这个帖子提了好几种方案，一定要仔细看看。

If your number of arbitrary row-keys is large, there will be a single core hang up on the MultiRowRangeFilter.sortAndMerge() step. This method could be expanded to parallelize the process of sorting and merging the list of keys into key ranges before creating the Filter for the scan. After the sort and merge, this method is indeed parallel across as many partitions as you have regions, and even reduces the number of round trips to hbase if you have many contiguous row-key ranges.
It's hard to say if that process would be efficient for you than spreading random gets across the cluster, as it wholly depends on many factors: record size, table size, row-key ranges, etc. I am confident that for many use cases this method would be more efficient, but obviously not for every use case.

try2020-4-24 0557 测试GetTable性能，单挑记录获取性能
测试并行计算hbase瓦片，modisclearsky，pid=100，level=6，全球8192瓦片，每个瓦片256x256x6 short型 约782KB。
先行后列生成list，rdd=paralell(list)自动分成了2个partitions，如果先行后列应该是分为南北两个半球。
exe1很快完成了，估计是西半球（或者南半球）；exe2运行很久估计东半球（或者北半球），不知道paralell如何分配的瓦片。
检索出有效瓦片336个，耗时920秒，约15分钟。数据大小260MB。单机处理速度理论在10秒以内。

try2020-4-24 0604 测试GetTable性能，单挑记录获取性能
测试8个分片 rdd=parallel(list).repartition(8)
通过日志可以看出默认重分区的算法是取余数。分8区，for(i in list){ partition = i%8 ; }
但是repartition阶段是在master中运行的呢？还是在executor运行的呢？
速度没快多少，估计是硬盘IO冲突造成的。
耗时 597秒，约10分钟

try2020-4-24 0620 测试GetTable性能，单挑记录获取性能
改用一个partition，避免硬盘IO竞争。
耗时870秒，约15分钟。和第一个测试基本一致。
通过上面三个测试，结论是单条记录Get性能太差。

基于大麦路由网线测试
try2020-4-24 0908 8个partitions
耗时607秒，基本月家里wifi速度一样，也是表明数据都是从hp硬盘读的，类似本地读取。


测试bulkGet
hbaseContext.bulkGet
1个分区
注意batchSize不能太大，太大造成内存不足。 batchSize=100 内存不足。
还是不清楚bulkGet和Scan的区别。
719秒，
8个分区
总是访问hbase失败
try2020-4-26 1707 8个partitions
基于大麦路由网线，Get接口增加column限制 get1.addColumn( "tiles".getBytes() , "20190601".getBytes() ) ;
18秒，不敢相信啊，怎么这么快啊
第二次运行（去掉addColumn方法）发现region server都死了。
重启hbase，第三次运行13.1秒，第四次9.3秒（估计有磁盘缓存的因素），第五次运行8.9秒。
获取值的化为了提高性能，减少内存拷贝操作CellUtil.cloneQualifier(cell1)，
考虑使用getValueArray，但是需要getValueLength()和使用getValueOffset()。耗时8.4秒。
使用getValueArray也拷贝了byte数组。我把getValueArray也注释调，耗时7.8秒。但是没法用这个数据了。
开12个并行任务，耗时8.1秒。
开2个并行，8.4秒。
开3个并行，10秒(9.8秒）。感觉3个节点，一个节点被强制作为driver，另两个节点用于worker做计算。所以3并行效果不好。
开4个并行，9.3秒
开6个并行，7.6秒
开8个并行，8.4秒
开10个并行，7.9秒
开16个并行，8.25秒
加了CellUtil.cloneQualifier(cell1)，6个并行，耗时7.5秒。可以看出来内存拷贝数据基本不耗时。
上面这个接口基本没法保证本地运行，由于是本地开的虚拟机，所以网络速度可以忽略。下面测试一下scan接口和newAPIHadoopRDD，争取把本地读取搞出来。
使用--deploy-mode client，释放了driver节点，worker节点达到3个，开6并行，计算时间为6.9秒。看日志，只有两个节点做计算，另一个节点始终没有计算，不知道为什么。
注意hbase里面jetty库与spark冲突，所以在spark-defaults.conf里面去掉这个/usr/local/hbase/lib/jetty-6.1.26.jar。
下一步看一下为什么3个节点只有2个能跑，再看看scan的性能。

新的四个节点（一个主机，3个虚拟机，使用大麦网线连接）测试基准bulkGet 336个瓦片。
8个并行，存在网络传输，耗时15秒,4个并行13.7秒。可以看出有网络传输，时间增加了接近一倍。


测试Scan
Scan有前途啊，scan的接口是（ hbaseContext.hbaseRDD(...) ）的RDD结果分区与RegionServer是一致的，
很有可能每个RegionServer都对应一个Partition应该是本地处理的。
耗时11.7秒(11.6秒)，但是由于startRow和stopRow并无法准确限制z,y,x 返回的结果包含别的层次，所以是464个瓦片。
构建rowkey时需要考虑把同一层，同一产品放在临近的rowkey里。否则做scan的时候中间会读到无用的数据。
设置cachesize=10 ， 11.57秒，11.43秒
设置cachesize=20， 11.8秒
设置cachesize=40， 11.8秒 基本没有效果
设置cachesize=100， 11.7秒 基本没有效果
setCacheBlocks(true),cachesize=40: 11.6 seconds nothing happends.
scan实际底层采用的是mapreduce.TableInputFormatBase(...)
问题出在task7和task9两个虚拟机任务耗时7秒钟，但是物理机117只跑了一个任务，三个虚拟机跑了一共跑了九个任务。
Task  Worker   Dura   bucket  Region
0    232       2      0       hp3(193)
1    193       2      1       hp3（193）
2    203       2      2       hp2(203)
3    117       5      3       hp2（203）
4    232       1      4       hp1(232)
5    203       1      5       hp2（203）
6    193       2      6       hp1（232）
7    203       8      7       master(117)
8    232       1      8       hp3（193）
9    193       7      9       master(117)
可以看到master117在执行task3的时候从远端拉取数据，导致执行时间5秒钟，在完成task3的时候，虚机hp1-hp3已经领取了第二轮的任务，master无事可做了。最后203和193从master拉取数据的task7和task9，同样耗时较长。

Spark HBase本地计算的问题
同样的jar在机器master节点运行和一个集群外客户机ubuntu运行的效果。在ubuntu运行的时候数据都是从ANY（远端）读进来的；而在master上运行，数据都是从NODE_LOCAL都进来的。我估计这其中在ubuntu运行的时候可能是spark的有些参数没有生效，或者是hbase某些配置没有获取到，导致spark无法识别hbase数据本地性质（locality）。在master运行jar，APP耗时7秒钟，task0-9运行耗时4秒钟。
【Scan好使！！】
try scan 在master运行464连续瓦片：3.8s，3.7s，3.9s
另外在Ubuntu 启动spark-submit时 --config spark.locality.wait.node=60 是没有效果的，60s也试过了，没用。

测试newAPIHadoopRDD
这个不用测了，scan接口后面调用的应该就是newAPIHadoopRDD。

bulkGet 10个partition 不考虑local分区，336瓦片耗时46秒，13秒，12秒，13

bulkGet采用自己排列rowkey确保分区与RS一致，获取336瓦片，batchSize=10：
try client:22.8s,22.8s  有几个NODE_LOCAL 有几个ANY
try master:9.3s,12.8s,9.3s  都是NODE_LOCAL 
以上可以看出来，虽然bulkGet的瓦片336比scan的瓦片464个少了一百多个，但是获取的时间确比scan长。

bulkGet采用自己排列rowkey确保分区与RS一致，获取336瓦片，batchSize=100：
master:22s,8.7s,12.4s,10.7s


由于Scan会扫描连续row，考虑使用 FuzzyRowFilter 和 MultiRowRangeFilter ，减少无效值读取优化性能。
FuzzyRowFilter rowFilter = new FuzzyRowFilter(
 Arrays.asList(
  new Pair<byte[], byte[]>(
    Bytes.toBytesBinary("\\x00\\x00\\x07"),   //for level7
    new byte[] {1, 1, 0})));               //for level7
scan.setFilter( new MultiRowRangeFilter(rowrangeList) )
try FuzzyRowFilter：
确实返回336瓦片，8.3s,3.4s,3.6s,9.2s,6,8s  第一次运行比较慢，紧接着第二次运行就快多了。


List<RowRange> rowrangeList = new ArrayList<RowRange>();
for (int i = 0; i < 10; i++)
{
    byte[] rk0 = new byte[4] ;
    byte[] rk1 = new byte[4] ;
    rk0[0] = rk1[0] = (byte)i ;
    rk0[1] = rk1[1] = 1 ;
    rk0[2] = rk1[2] = 7 ;
    rk0[3] = 0 ;
    rk1[3] = 127 ;
    MultiRowRangeFilter.RowRange rowRange = new MultiRowRangeFilter.RowRange(
            rk0 , true,
            rk1 , false);
    rowrangeList.add(rowRange);
}
scan.setFilter( new MultiRowRangeFilter(rowrangeList) )
try MultiRowRangeFilter:
336瓦片,8.16s,5.5s,3.9s,4.1s,4.3s,4.3s,4.1s

try fuzzy again:
3.98s,3.8s,4.0s,3.9s


************************************************
*                                              *
*      关于Spark-submit Jar打包的问题           *
*               2020-4-23                      *
************************************************
主要目的缩小jar包体积，减少网络传输时间，尽量加载本地的jar。
如果没有配置local前缀的jar路径那么每个worker会从主节点拷贝依赖项jar
日志如下：
20/04/23 09:43:28 INFO Utils: Fetching spark://ubuntu:46431/jars/guice-servlet-3.0.jar to /tmp/spark-b2092f98-7fa0-437a-a7f6-5d388f8da4c7/executor-280ca792-751f-4d48-aab1-42161a8debfb/spark-3fa6d7d8-7efb-4cfb-b77c-5be2a10377dd/fetchFileTemp5981507556214673661.tmp
20/04/23 09:43:28 INFO Utils: Copying /tmp/spark-b2092f98-7fa0-437a-a7f6-5d388f8da4c7/executor-280ca792-751f-4d48-aab1-42161a8debfb/spark-3fa6d7d8-7efb-4cfb-b77c-5be2a10377dd/16541112081587635001834_cache to /home/hadoop/spark/app-20200423174323-0000/0/./guice-servlet-3.0.jar
开始43：28~结束43：56 = 28秒拷贝150MB的jar


主节点ubuntu配置了 spark.jars local:/xxx.jar,local:/xxx.jar....

232 配置了local:/usr/local.../*.jar 
日志显示从本地拷贝jar到app临时目录
20/04/23 10:31:16 INFO Executor: Fetching file:/usr/local/hbase/lib/hadoop-yarn-server-web-proxy-2.8.5.jar with timestamp 1587637856316
20/04/23 10:31:16 INFO Utils: Copying /usr/local/hbase/lib/hadoop-yarn-server-web-proxy-2.8.5.jar to /tmp/spark-c69e7879-ffa1-44c1-91e6-872cdf313828/executor-d9e7cb85-77d2-4a5a-b82d-76fe17cac0a1/spark-f63c1b6f-84ca-488a-8fdc-a57803449447/-13691789771587637856316_cache

最终考jar耗时 30：59 ~ 31：17   约 18秒 150MB的jar

203没有配置 local
日志显示也是从本地拷贝jar，看来只需要保证master节点spark.jars的local路径配置就行了。
同样203耗时也是17~18秒考完jar

我的目的是不要拷贝jar包，省下这些时间。
经过多番实验，需要编辑spark-defaults.conf 配置两个启动变量
spark.driver.extraClassPath /some/dir/somejar1.jar:/other.jar:....
spark.executor.extraClassPath /some/dir/somejar1.jar:/other.jar:....

注意我首先直接把所有hbase下面的jar都放在上面路径了，但是提示netty错误，原因是spark自己的netty和我指定的hbase netty版本不兼容，随后在conf文件中去掉hbase的netty库，然后再运行就可以了。


************************************************
*                                              *
*      关于主节点更换重新建立新集群的操作步骤     *
*               2020-5-8                      *
************************************************
1.修改hadoop，hbase，spark的配置文件。
2.删除每个节点datadir下面的内容
3.删除主节点namedir和zookeeper下面的内容（删除zookeeper目录下内容很重要）
4.格式化主节点 hdfs namenode -format
5.启动dfs，hbase，spark。

注意如果datadir目录下有内容，hdfs在该节点不会启动。
注意如果zookeeper目录下有内容，hbase shell操作的时候会提示master is initializing.


************************************************
*                                              *
*           MultiRowRangeFilter                *
*               2020-6-3                      *
************************************************
spark使用MultiRowRangeFilter以后没有Local处理，运行起来后RS都挂掉了，估计是资源消耗太大了。
发现原因了，因为我没有限制scan的Columnfaimily和qulifier。


************************************************
*                                              *
*           最大程度让Spark本地读取Hbase数据     *
*               2020-6-3                       *
************************************************
默认配置好的spark，在从hbase拿数据的时候总是ANY获取（远端拉取数据），性能较差。
通过分析spark和hbase的web ui，我发现spark标记worker时候使用的是ip地址，
而hbase配置的RegionServer的标记是计算机名，
那么我推测sparkhbase-connector在分析inputsplit的时候无法将hbase的RS名字映射到Spark的worker上面，
所以造成Local的配置参数基本没用了，因为inputsplit也不知道哪个RS对应哪个worker。
解决思路，配置spark-env.sh，增加计算机名字的配置项，如下：
export SPARK_LOCAL_HOSTNAME=lx1
lx1这个地方要配置当前这个计算机的名字，所以这个地方需要每个计算机都要配置一下。
配置好spark-env.sh以后，重启start-slave.sh，然后spark webUI显示的worker就都用计算机名了。
然后再次运行task，可以发现出现NODE_LOCAL任务了。

我发现即使spark运行Node—Local有的节点运行速度也不行，
仔细研究hbase的store file发现，storefile对应hdfs文件并不在这个RS主机上，
同时也验证了那些运行速度很快的任务，他们的HDFS文件和RS都在同一个机器上，所以速度特别快。
查看RS的storefile的本地化指标可以通过hbase对应表的页面，在Table Regions里面查看Locality查看得到本地化的百分比。1.0是最好的。

迁移Region
move 'd762f937cdb913e5f8e706cc65694891','lx4,16020,1591166536062'
move '{REGION_ID}'                     ,'{NEW_REGIONSERVER}'

Region迁移过去了，但是数据还没有过去，耗时30s。

大压缩major compact
major_compact '{table_name}'

使用Scan设置startrowkey和endrowkey，采用ColumnRangeFilter圈定时间范围，获取9678个瓦片，耗时29s。
使用Scan设置MultiRowRangeFilter和ColumnRangeFilter圈定level9和时间范围，获取5624瓦片，耗时19秒。
MultiRowRangeFilter不影响Node_Local，使用该filter仍然优先NodeLocal。


************************************************
*                                              *
*           Spark批量写入Hbase                 *
*               2020-9-29                       *
************************************************

hbaseContext.bulkPut(rddscan, TableName.valueOf("sparkv8out"),
    new Function<WHBaseUtil.TileId, Put>() {
        @Override
        public Put call(WHBaseUtil.TileId tid) throws Exception {
            byte[] emptyRowkey = {...} ;
            Put put1 = new Put(emptyRowkey) ;
            put1.addColumn(...,...,....) ;
            return put1 ;
        }
    });

需要注意的是Put1必须是有效的Put，rowkey必须有效，值必须有效，只要有null的Put或者Put里面有无效值那就会报错退出。

************************************************
*                                              *
*           Spark+ V8计算批量写入Hbase         *
*               2020-9-30                      *
************************************************
在两个阶段可以生产TileComputeResult(TCR)，
（1）一个是构造RDD的时候，生产TCR的RDD，然后下一步bulkPut(TCR-RDD)
（2）另一个阶段是bulkPut的时候进行TCR计算，但是这一步会存在瓦片计算失败导致Put失败的情况。
采用（1） 的方式时，我担心一个Region的TCR结果都放在内存会放不下。看来只能试一试了。
实验结果，没有定论，使用一景中国区1km的fy3d数据6波段反正是放到rdd里面了。
下面描述一个现象：
由于我要获取v8计算返回数据集的长，宽，波段，数据类型等信息，所以在rdd.bulkPut(...) 之后我调用了rdd.top(1)
为了返回一条TCR记录，结果我发现rdd中全部瓦片又跑了一遍，477个瓦片跑了两遍，虽然在读取输入数据方面hbase做了缓存，
第二遍计算节省了一些时间，但是这个确实是极大的浪费。
计算两遍的证据可看如下log日志：
=== rddscan numPartitions:10
=== total scan tile accumu:954
=== good tile accumu:954
=== good put accumu:477
实际put了477个瓦片，但是计算了954个。
同时也证明了一件事，HBase在Scan的时候只会遍历有数据的Cell，如上可以看出，Scan的总瓦片数量是954等于瓦片计算的数量，说明进行到瓦片计算的时候Result对象没有Empty的。
另一个发现是，当使用Top(1)的时候，spark并不是随便选择一个Partition算一个返回第一条结果，而是全部Partition（10个）全部重新计算了一遍，并且每个Partition均返回给Driver一个记录，然后在Driver程序中选择一个记录返回给结果。而我觉得理想情况是选择一个Partition只返回第一条计算成功的记录。
看来不能通过top(1)的方式得到一条记录，还需要想别的办法。
找到解决方案了 ： CollectionAccumulator


************************************************
*                                              *
*           Spark CollectionAccumulator         *
*               2020-9-30                      *
************************************************
SparkContext.collectionAccumulator(v) 
CollectionAccumulator 提供的方法不限于Add，包含如下方法：
    isZero
    copyAndReset
    copy
    reset
    add
    merge
    value
可以行。
在调用 collaccu.isZero() 时只在当前Partition有效，比如只想要一个记录那么就可以用isZero判断一下在Add。
注意最后回到Driver的时候有几个Partition就会把这些Partition的collection合并到一起。
使用CollectionAccumulator后spark输出log如下：
Info : collAccumuOutTileDataType is good. coll size : 10
=======outtile: w,h,nb,dt:256,256,1,3
=== rddscan numPartitions:10
=== total scan tile accumu:477
=== good tile accumu:477
=== good put accumu:477


************************************************
*                                              *
*           sparkDriver can not bind random port*
*               2020-10-30                      *
************************************************
sparkDriver can not bind random port 出现该问题解决方法
检查/etc/hosts 本机的ip地址是否正确
检查/usr/local/spark/conf/spark-env.sh中 export SPARK_LOCAL_IP=xxx.xxx.xxx.xxx后面的IP是否正确。


************************************************
*                                              *
*          卫星中心防火墙的问题                *
*   java.net.NoRouteToHostException            *
*               2021-3-31                      *
************************************************
症状描述：
卫星中心郑伟老师集群4台机器，zwhadoopo1-4。安装了hadoop，hbase和spark。
启动后发现hbase在3号机器始终无法写入数据。最初以为是hbase配置问题。
检查了一下hbase在3号机器的log，发现是写入数据失败，以为是磁盘读写权限问题，
检查后排除了这个问题，都是有读写权限的。
单独启动hdfs，发现了问题，3号机器虽冉在web界面中显示正常，但是仔细看3号机器的hadoop log日志，
发现无法写入数据，创建blk失败等等这种，应该就是写不了。在排查log日志，
发现出现了blk_1073741835_1011 to mirror 10.25.21.3:50010: java.net.NoRouteToHostException: No route to host
这个异常。初步猜测是ip地址或者主机名配置错了，但是双方互ping都没问题。再次猜测可能是防火墙的问题，
对于3好机器应该是开放了部分端口，所以在web界面可以看到节点在线。但是类似50075,16010这些端口都没开放。
上网查了一下hadoop+java.net.NoRouteToHostException 也印证了我这个观点，应该是服务器的防火墙处于打开状态了，导致hadoop和hbase的通讯失败。
使用命令检查1-4号机器防火墙状态，发现确实只有3号机器的防火墙是开着的状态。
centos7.x
查看防火墙状态 firewall-cmd --state
关闭防火墙 systemctl stop firewalld.service

解决方法：
关闭3号机器的防火墙。