Upgrade Type and Resource Requirements

Plan Assumptions

Exascaler

This upgrade plan requires an outage
System must be completely healthy before upgrade
The EXAScaler upgrade needs the direct involvement of DDN support during upgrade execution
These instructions only work upgrading from EXAScaler 3.x to 4.x
- Your current EXAScaler version must be at 3.0 or above
The DDN upgrade engineer verifies Lustre source files are correct for any manual Lustre upgrades
Lustre clients may need to be upgraded

SFA

Recommended to rebuild any storage pools in no redundancy state before proceeding with upgrade
During upgrades there is a chance of a drive failure or drives will go missing for a period of time requiring pool rebuilds.
If verifies were disable for 3 or more months, 2 complete verify cycles must be completed before attempting an upgrade.
There are no embedded VMs
SFA public SSH keys will be changed during/after the upgrade. Please ensure the existing password is known.

Summary of work / Order / Timing

time	description
0.5 hour	Before EXAScaler upgrade work
2.0 hour	Prep for EXAScaler Upgrade
0.5 hour	EXAScaler Health Check
4.0 hour	Upgrade EXAScaler Servers (0.5 hr per server)
0.5 hour	Updating exascaler.conf
0.5 hour	Multipath, SRP configuration & IPOIB datagram mode
0.5 hour	ES Install steps
0.5 hour	HA Configuration
0.5 hour	HA startup & EXAScaler upgrade post work
0.5 hour	Optional: Lustre rpm upgrade
? hour	Optional EF enclosure firmware upgrade & enable vdisk scrub
? hour	Optional: Upgrade Mellanox/OmniPath Firmware, IPoIB configuration (0.25 hr per node)
0.5 hour	Upgrade SFA post work (enable cache, enable bg verify, set sparing back to original setting, health checks)
Total upgrade time	10.5 hours

System Details

SFA
Subsystem Name	osctl
Controller Type	SFA14KX
Uptime for Controllers	Controller 0 72 Days 21 Hours 56 Minutes, Controller 1 72 Days 21 Hours 56 Minutes
Enclosure Type	6- SS14K(head + 5-SS8462/12) with 0 missing enclosures
Number of Mux Drives	0
Internal Disk Mirroring	mirrored
BBU Mfg Date	ups 0 - Fri Aug 30 18:04:57 2013; ups 1 - NOT AVAILABLE
Background Verify	Enabled
Write Caching	Enabled

Current EXAScaler
EXAScaler Version	3.3.0-r4
Lustre version	2.7.21.3
Lustre build	256.ddn20.g10dd357

Current Infiniband / OmniPath
EXAScaler Server OFED/OPA version	MOFED 4.3.1.0.1
Server HCA Firmware Server	12.18.1000

Current Ethernet
EXAscaler server MTU size	-
EXAscaler server Ethernet bonding?	No
Tagged VLAN?	N/A

Upgrade Component List and Links

EXAScaler Upgrade Items Summary

Description	Current Level	New Level	Comments	Internel URL
ExaScaler Release	-	4.0.0	ISO Image (3 GB)	ES 4.0.0 ISO
Lustre Version	-	2.10.4-ddn4	upgrade	ZIP file

| Lustre Client Version:| - |

Mellanox HCA firmware updater tool: mlxup v4.9.0

mlxup tool is located in /root/EXAScaler-4.0.0/

Alternatively download from mellanox.com

Detailed Process

SFA pre-upgrade work

Disable verifies

   $ set subsystem verify_policy false

Disable caching

   $ set pool * write_back_caching false

View existing jobs

   $ show job (view existing jobs)

NOTE: Wait for all remaining jobs to complete before continuing

Document current "sparing policy" and "disk time-out" value

   $ show pool *

Set sparing policy to manual for all pools

   $ set pool * sparing MANUAL

Set drive time out to 4 houra

   $ set pool * disk_timeout 240

Capture diags from each controller

Password=user

   # ssh diag@IP_ADDRESS_OF_CONTROLLER_0 tgz > SR[#####]_[customer-name]_C0_Diag.tgz
   # ssh diag@IP_ADDRESS_OF_CONTROLLER_1 tgz > SR[#####]_[customer-name]_C1_Diag.tgz

SFA External Block Pre-upgrade

Document server IPMI credentials and network configuration
Download all required files with supplied links (eg. ISO image file, HCA firmware)
Ensure that at least 3GB of free space are available in server root partition

EXAScaler pre-upgrade work

Review EXAscaler release notes for limitations or known issues for upgrades
Review Upgrade section of EXAscaler Administration Guide
Ensure "Open Issues/Actions" items as completed before upgrade complete
Ensure the system is healthy before starting upgrade
Back up all EXAScaler configuration files for each HA group

eg) /etc/multipath.conf, /etc/corosync/corosync.conf, /etc/corosync/authkey, /etc/ddn/exascaler.conf, /etc/ddn/device_alias.conf, /etc/ddn/ddn-ibsrp.conf, /etc/srp_daemon.conf, /etc/ddn/srp_daemon.conf, /etc/ddn/tune_devices.conf, /etc/clustershell/, /etc/syslog-ng/, /scratch/log/
Backup OS configuration files

eg) /etc/sysconfig/network-scripts/ifcfg-*, /etc/hosts, /etc/resolv.conf, /etc/sysconfig/network, /etc/ntp.conf, /etc/passwd, /etc/group, /etc/modprobe.conf, /etc/modprobe.d/, /etc/sysctl.conf, /etc/sysctl.d/, /root/.ssh/, /etc/ssh/, nsswitch, ldap/sssd, etc.
Disable all external yum repos (ie. /etc/yum.repos.d) and custom parameters in the yum configuration files ie) /etc/yum.conf

EXAScaler Health Check

Note: An outage is required for this upgrade. This is not an online procedure

Client node Health checks

NOTE: Any discovered issues should be resolved before proceeding with the upgrade. Contact DDN support for assistance if necessary.

From a client node, check lustre is mounted:

   # mount -t lustre
     10.10.160.31@tcp1:10.10.160.32@tcp1:/pfs on /lustre/pfs/client type lustre (rw)
   # findmnt -t lustre

Check client can access lustre servers

Example output

   # lfs check servers
   pfs-MDT0000-mdc-ffff88003d223400: active
   pfs-OST0000-osc-ffff88003d223400: active
   pfs-OST0001-osc-ffff88003d223400: active
   pfs-OST0002-osc-ffff88003d223400: active
   pfs-OST0003-osc-ffff88003d223400: active
   pfs-OST0004-osc-ffff88003d223400: active

check client can access all targets

Example output

   # lctl dl
   0 UP mgc MGC10.10.160.31@tcp1 9cba63c0-f7ce-c14e-1d95-3312a141b982 5
   1 UP lov pfs-clilov-ffff88003d223400 581969a1-7a8a-d371-3683-2fe3c44e4677 4
   2 UP lmv pfs-clilmv-ffff88003d223400 581969a1-7a8a-d371-3683-2fe3c44e4677 4
   3 UP mdc pfs-MDT0000-mdc-ffff88003d223400 581969a1-7a8a-d371-3683-2fe3c44e4677 5
   4 UP osc pfs-OST0000-osc-ffff88003d223400 581969a1-7a8a-d371-3683-2fe3c44e4677 5
   5 UP osc pfs-OST0001-osc-ffff88003d223400 581969a1-7a8a-d371-3683-2fe3c44e4677 5
   6 UP osc pfs-OST0002-osc-ffff88003d223400 581969a1-7a8a-d371-3683-2fe3c44e4677 5
   7 UP osc pfs-OST0003-osc-ffff88003d223400 581969a1-7a8a-d371-3683-2fe3c44e4677 5
   8 UP osc pfs-OST0004-osc-ffff88003d223400 581969a1-7a8a-d371-3683-2fe3c44e4677 5

df listing all targets:

Example good output

   # lfs df -h
   UUID                       bytes        Used   Available Use% Mounted on
   pfs-MDT0000_UUID           10.8G        4.0G        6.0G  40% /lustre/pfs/client[MDT:0]
   pfs-OST0000_UUID         1021.9M       52.9M      958.8M   5% /lustre/pfs/client[OST:0]
   pfs-OST0001_UUID         1021.9M       52.9M      958.8M   5% /lustre/pfs/client[OST:1]
   pfs-OST0002_UUID         1021.9M       52.9M      958.8M   5% /lustre/pfs/client[OST:2]
   pfs-OST0003_UUID         1021.9M       52.9M      958.8M   5% /lustre/pfs/client[OST:3]
   pfs-OST0004_UUID         1021.9M       52.9M      958.8M   5% /lustre/pfs/client[OST:4]
   
   filesystem summary:        12.0G      634.6M       11.2G   5% /lustre/pfs/client

Example bad output

   # lfs df -h
   UUID                       bytes        Used   Available Use% Mounted on
   pfs-MDT0000_UUID           10.8G        4.0G        6.0G  40% /lustre/pfs/client[MDT:0]
   pfs-OST0000_UUID         1021.9M       52.9M      958.8M   5% /lustre/pfs/client[OST:0]
   pfs-OST0001_UUID         : Resource temporarily unavailable 
   pfs-OST0002_UUID         1021.9M       52.9M      958.8M   5% /lustre/pfs/client[OST:2]

Server Health Checks

Check EXAscaler servers are online:

From one EXAScaler server:

    # esctl pingall
    # corosync-quorumtool
    # hastatus
    # clush -b -a "uptime"

Ensure all servers are currently at same software levels

   # clush -b -a "cat /etc/es_install_version"
   # clush -b -a "cat /proc/fs/lustre/version"

Verify all lustre devices are mounted on the servers

   # clush -a "mount -t lustre"

   ----------------
   mds0
   ----------------
   /dev/mapper/vg_pfs-mgs on /lustre/mgs type lustre (rw)
   /dev/mapper/vg_pfs-mdt on /lustre/pfs/mdt type lustre (rw)
   ----------------
   oss0
   ----------------
   /dev/mapper/ost_pfs_0 on /lustre/pfs/ost_0 type lustre (rw)
   /dev/mapper/ost_pfs_1 on /lustre/pfs/ost_1 type lustre (rw)

Verify lustre healthy on servers

NOTE: All servers should report 'healthy'. There is problem if another value is reported.

   # clush -b -a "cat /proc/fs/lustre/health_check"

Pull es_showall logs before making any changes

   # esctl showall -s <DDN_CASE_NO>

Prep for ExaScaler Upgrade

Unmount all compute clients and unload lustre modules

For each client execute the following set of commands

Find lustre mount point

   # mount -t lustre
     10.10.160.31@tcp1:10.10.160.32@tcp1:/pfs on /lustre/pfs/client type lustre (rw)
   # findmnt -t lustre

Unmount lustre

   # umount -a -t lustre
   # modprobe -rv lustre osc mgc

Stop lustre network

   # service lnet stop

Remove Lustre kernel modules

   # lustre_rmmod

Server Halt Services

From EXAScaler server

Stop all Server cluster resources

   # cluster_resources --action stop

Check status of HA shutdown for each HA group

Log into any node in each HA group

   # clush -g ha_heads "hastatus"

NOTE: If HA '(corosync|pacemaker)' is not running, the following error will be generated:

Example bad output

   Could not establish cib_ro connection: Connection refused (111)
   Connection to cluster failed: Transport endpoint is not connected
   Could not establish cib_ro connection: Connection refused (111)
   Connection to cluster failed: Transport endpoint is not connected

Verify that all Lustre devices are unmounted

   # clush -a "mount -t lustre"

Shutdown HA resources

NOTE: Perform his action for each HA group.

Start from HA group with OST resources and last HA group with MDT resources.

   # clush -g ha_heads "crm configure property stop-all-resources=true"

Unmount remaining mounted Lustre devices (can take a few minutes)

Log into any MDS node

   # clush -a "es_mount --umount"

Verify that all Lustre devices are unmounted

   # clush -b -a "mount -t lustre"

Unload lustre modules

   # clush -a "service lnet stop"
   # clush -a lustre_rmmod

Stop the HA stack services (Pacemaker and Corosync) on alls node

   # clush -a service pacemaker stop
   # clush -a service corosync stop

Upgrade ExaScaler Servers

Copy the EXAScaler iso file to all the nodes and mount as a loop device

From EXAScaler server

   # scp -p <ES_UPGRADE.ISO> root@<ES_Server>:/scratch/
   # sync-file /scratch/<iso_image_filename>
   # clush -a "mkdir -p /mnt/iso"
   # clush -a "mount -v -t iso9660 -o loop,ro /scratch/<ES_UPGRADE.ISO> /mnt/iso"

Run the upgrade script in dry-run mode to check for errors

   # clush -a "python /mnt/iso/files/es_upgrade.py -v --dry-run"

Check each dry run for errors

Run the upgrade script on each host to apply the upgrade

NOTE: You can run multiple upgrades in parallel

   # clush -a "python /mnt/iso/files/es_upgrade.py -y -v"

Copy files from ISO image to EXAScaler server

   # mkdir -v /root/EXAScaler-4.0.0/
   # cp -vp /mnt/iso/files/* /root/EXAScaler-4.0.0/
   # cp -vp /mnt/iso/Packages/kernel-abi-whitelists-*.noarch.rpm /root/EXAScaler-4.0.0/
   # cp -vp /mnt/iso/Packages/kernel-devel-*.x86_64.rpm /root/EXAScaler-4.0.0/
   # cp -vp /mnt/iso/Packages/kernel-lustre-devel-*.x86_64.rpm /root/EXAScaler-4.0.0/
   # cp -vp /mnt/iso/Packages/kernel-lustre-debug-devel-*.x86_64.rpm /root/EXAScaler-4.0.0/

   # sync-file /root/EXAScaler-4.0.0/

Unmount ISO image

   # clush -a "umount -v /mnt/iso"

Reboot all EXAScaler nodes

   # clush -a reboot

Verify that new version is installed on all server nodes

   # clush -b -a "cat /etc/es_install_version"

Update exascaler.conf

Make a backup copy of EXAScaler configuration file */etc/ddn/exascaler.conf* before modifying

NOTE: Examples of valid configuration can be found in the folder */opt/ddn/es/example/

SFA zoning

For each SFA storage used by the cluster, create a SFA zone named [sfa <sfa_name>] and add the following options to these sections:

| controllers - IP addresses of both SFA controllers | user — SFA access account. The value is “user” by default. | password – password for SFA access account. The value is “user” by default.

Example

| [sfa sfa12k0] | controllers: 10.52.16.44 10.52.16.45 | user: user | password: user

In the [global] section, replace the controller IP-addresses with Zone names from sfa_list parameter

Example

| [global] | sfa_list: sfa12k0

In each [host <hostname>] section append host_sfa_list entry with appropriate SFA zone

Example

| [host mds0] | host_sfa_list: sfa12k0

NTP

In the [global] section, add IP address or hostname of NTP server to entry ntp_list

Example

| [global] | ntp_list: 10.0.0.5

Corosync rings

In the [HA] section of /etc/ddn/exascaler.conf, replace network interface names in parameter corosync_nics with corosync rings.

NOTE: ring0 and ring1 are only allowed.

Example

| [HA] | corosync_nics: ring0 ring1

For each Corosync ring, specify the network interface

Either in the [host defaults] section or specific hosts section [host <node>].

Example

| [host oss-1-srv] | ring0: ens3 | ring1: ib0

nic_list & lnets

Check network interfaces are listed in nic_list parameters in the [host defaults] section or specific hosts named like [host <node>].

NOTE: entries in [host ] section override [host_defaults] parameters

Example

| [host oss-1-srv] | nic_list: ens3 ib0

Check entries for lnets parameters in the [host defaults] section or specific hosts named like [host <node>]

Example

| [host oss-1-srv] | lnets: tcp0(ens3) o2ib0(ib0)

backend FS

Set backend FS type to ldiskfs. See file /opt/ddn/es/examples/exascaler.conf for details.

In each [fs ] section append backfs entry:

| [fs testfs] | backfs: ldiskfs

lustre mount options

Prevent lustre.mount command from changing DDN block device tuning parameters; option max_sectors_kb=0 will prevent lustre from changing current value.

In each [fs ] section append mount_opts entries:

| [fs testfs] | mdt_mount_opts: max_sectors_kb=0 | ost_mount_opts: max_sectors_kb=0

lustre o2iblnd options

Add or modify modprobe settings for lustre LND driver o2iblnd. Option "conns_per_peer" sets number of QPs for each lustre rdma connection. See modinfo ko2iblnd for details.

Defaults: conns_per_peer=1 for Mellanox and conns_per_peer=4 for OmniPath. For backward compatibility we set conn_per_peer=1.

NOTE: entries in [host ] section override [host_defaults] parameters

Example

| [host_defaults] | modprobe_cfg: | options ko2iblnd conns_per_peer=1

External Block SRP configuration

NOTE: Further information in "Known Issues" section of EXAScaler Release Notes

confirm ib_srp devices are visible

   # lsscsi -lH | grep -A2 ib_srp

External Block Multipath configuration

NOTE: Multipath alias name format described in appendix of Exascaler Install & Admin Guide

confirm ddn-ibsrp service is running (ie. opensm + srpd)

   # service ddn-ibsrp start

confirm SFA/EF multipath devices are visible

   # lsscsi -l

backup existing multipath.conf

   # cp -vp /etc/multipath.conf /etc/multipath.conf.ESpre-upgrade

Regenerate SFA device wwids and alias names with command "create_table" for each SFA storage system:

NOTE: connected block devices wwids will be listed in /dev/disk/by-id/
From server connected to SFA storage:

   # create-table --new-style --type multipath --sfa <sfa_ip_address> >> /etc/multipath-aliases.sfa

check multipath aliases match new EXAScaler naming format

Example
- old name format: alias ost_scratchfs_1
- new name format: alias scratchfs_ost0001
Append multipath aliases to multipath.conf

   # cp -v /etc/multipath.conf.ddn /etc/multipath.conf
   # cat /etc/multipath-aliases.sfa >> /etc/multipath.conf

For EF enclosures restore EF multipath aliases from backup multipath.conf file NOTE: EF enclosures usually connected to MDS servers. EF multipath devices may not be visible from OSS server nodes
restart multipath service

   # multipath -F
   # udevadm trigger
   # systemctl enable multipathd 
   # systemctl restart multipathd
   # multipath -l

check device aliases listed in /dev/mapper/
copy file multipath.conf to other server nodes

NOTE: ensure not to overwrite multipath.conf containing EF aliases
flush & restart multipath service on other server nodes

Infiniband IPoIB Configuration

Mellanox ConnectX-4 Infiniband adapters enable IPoIB Connected Mode by default. We have have found scaling issues with Connected Mode (CM) with larger Infiniband networks ie >50 clients. IP over Infiniband kernel documentation

We recommend disabling CM on EXAScaler servers and clients.

Set IPoIB datagram mode

Configuration file /etc/infiniband/openib.conf is used by Mellanox OFED.

Set configuration parameter SET_IPOIB_CM=no

Example

   cp -vp /etc/infiniband/openib.conf /etc/infiniband/openib.conf.ESpre-upgrade
   sed --in-place -e 's/SET_IPOIB_CM=.*/SET_IPOIB_CM=no/' /etc/infiniband/openib.conf

ES Install steps

Run script about_this_host from each node and validate no issues reported

| # about_this_host --check

Example

   # about_this_host --check
   Host name: es51
   EXAScaler version: 3.2.0
   EXAScaler flavour: HPC
   Peers of this host are: es52
   Network interface: eth0 (192.168.3.51)
   Network interface: eth1 (172.16.2.51)
   Lustre is exported on interfaces: tcp(eth0)
   Stonith type: none
   Details for filesystem testfs30
   mdt runs on this host: /dev/vg_mdt0000_testfs30/mdt0000 /lustre/testfs30/mdt0000
   /dev/mapper/testfs30_mdt0000_s0 (prio: 1)
   mgs runs on this host: /dev/vg_mgs/mgs /lustre/mgs
   /dev/mapper/mgs (prio: 1)
   ost 0 runs on this host: /dev/mapper/testfs30_ost0000 /lustre/testfs30/ost0000
   ost 1 can failover to this host: /dev/mapper/testfs30_ost0001 /lustre/testfs30/ost0001
   FSCK logs are saved to /scratch/log/testfs30

Configure exascaler with the following command:

Details about es_install steps are documented in EXAScaler Install & Adminstration guide. Only a few es_install steps are used during upgrades

NOTE: The "es_install --steps" recipe may change case-by-case

ie) depending on whether customer is OK with overwriting network config files, modprobe filea, IPMI, etc.

NOTE: "es_install --steps lvm" will overwrite /etc/lvm/lvm.conf. On some servers (eg. HP Proliant) the LVM config changes can cause the server to hang on next reboot.

NOTE: "es_install --steps ha" might complain about file exists /etc/corosync/corosync.conf

IMPORTANT: skip step: nics, ipmi, restart_network, lvm, & lustre
On each Exascaler node run the es_install tool

Example output:

   # es_install --debug --yes --steps os,kdump,ntp,modprobe,logging,mount_points
   Operation system configuration ...
   Operation system was successfully configured.
   Kdump configuration ...
   Kdump was successfully configured.
   Modprobe configuration ...
   Modprobe was successfully configured.
   Logging configuration ...
   Logging was successfully configured.
   Mount points configuration ...
   Mount points were successfully configured.

Reboot server and check configuration

   # about_this_host --check

Post upgrade mount and checks

Test and mount lustre targets on each node individually

Verify that Lustre volumes can be mounted manually on each server node.

   # es_mount --dry-run

Mount Lustre targets

Lustre targets are mounted in the following order: 1) MGS volume 2) MDT volumes 3) OST volumes

   # es_mount

Check mounts

   # clush -a "mount -t lustre"
   # clush -a "lustre_recovery_status.sh"

Verify that Lustre is healthy on all server nodes

   # clush -a "cat /proc/fs/lustre/health_check"

test mounting lustre on a client

   # mount_lustre_client --dry-run

validate client lustre access

   # lfs check servers
   # lfs df -h
   # ls -l /path/to/lustre/

after validation umount lustre client and lustre targets

lustre targets are umounted in following order: 1) lustre clients 2) OST volumes 3) MDT volumes 4) MGS volume

   # mount_lustre_client --umount
   # es_mount --umount
   # clush -a "mount -t lustre"

Additional Upgrade Tasks

Complete Optional Upgrade tasks

HA Configuration

NOTE: If multiple HA groups are being used, use "clush -g ha_heads" to run command on head node each HA group

NOTE: HA group entries are stored in /etc/clustershell/groups.d/local.cfg

Erase current HA config

This step will stop pacemaker & corosync services on all EXAScaler nodes

   # clush -a "service pacemaker start"
   # esctl cluster --action destroy

Re-generate corosync configuration

   # clush -a "config-corosync --regen-config"

Start corosync service

   # clush -a "systemctl enable corosync"
   # clush -a "systemctl start corosync"

validate corosync service running

   # clush -g ha_heads "corosync-quorumtool"

pacemaker HA configuration

NOTE: corosync service must be running

NOTE: umount lustre targets from EXAScaler nodes before starting

Start pacemaker service

   # clush -a "systemctl enable pacemaker"
   # clush -a "systemctl start pacemaker"

validate pacemaker service

   # clush -g ha_heads "hastatus"

NOTE: If HA '(corosync|pacemaker)' is not running, the following error will be generated:

Example bad output

   Could not establish cib_ro connection: Connection refused (111)
   Connection to cluster failed: Transport endpoint is not connected
   Could not establish cib_ro connection: Connection refused (111)
   Connection to cluster failed: Transport endpoint is not connected

Re-generate HA configuration

Run commands for each HA group

   # clush -g ha_heads "config-pacemaker --dry-run > /root/config_pacemaker.`date +%F`"
   # clush -g ha_heads "config-pacemaker"

validate HA configuration

   # clush -g ha_heads "hastatus"

HA startup

Start all HA resources & lustre targets:

   # cluster_resources --action start

NOTE: this command will umount & re-mount lustre targets

check HA startup

   # clush -g ha_heads "hastatus"

validate lustre targets mounted

   # clush -a "mount -t lustre"
   # clush -a "lustre_recovery_status.sh"

Verify that Lustre health on all server nodes

   # clush -b -a "cat /proc/fs/lustre/health_check"

EXAScaler post work

NOTE: Ask customer to run client side checks

capture showall logs

   # esctl showall -s <ddn_support_case_number>

upload ddn_showall file to ftp.ddntsr.com

   # curl --ftp-ssl-ccc --keepalive-time 10 -T ./<path_to_ddn_showall_file> ftp://ftp.ddntsr.com/upload/ --user anonymous:<my_email_address>

SFA post work

Enable cache, enable bg verify, set sparing back to original setting, health checks

Enable caching

   $ set pool * write_back_caching true

Resume any jobs that may have been previously paused

   $ show job
   $ resume job X #where X is the job id

Enable background verify

   $ set subsystem verify_policy true

For each spare pool, set sparing policy to what it was before the upgrade for all pools

   $ set pool * sparing <AUTO/SWAP/etc>

For each spare pool, set drive time-out to what it was before the upgrade for all pools

   $ set pool * disk_timeout <PREVIOUS VALUE>

Capture diags from each controller

Password=user

   # ssh diag@IP_ADDRESS_OF_CONTROLLER_0 tgz > SR[#####]_[customer-name]_C0_Diag.tgz
   # ssh diag@IP_ADDRESS_OF_CONTROLLER_1 tgz > SR[#####]_[customer-name]_C1_Diag.tgz

If required initialize Battery Life Remaining Feature a. Log in to the controller

b. Issue the command: show ups * all

c. If a battery manufacturing date is displayed, do nothing more. However, if the battery manufacturing date and life remaining are not available as shown below, proceed to Step 4.

| Battery Mfg. Date: NOT AVAILABLE | Battery Life Remaining: NOT AVAILABLE

d. Issue the command `SET UPS <encl-idx> <UPS-idx> BATTERY_MANUFACTURE_DATE=<yyyy>:<mm>:<dd>`

   Choose a date close to when the system was installed. 

e. Verify it was set - issue the command `show ups * all` and you should see output similar to the following:

| Battery Mfg. Date: Thu Sep 8 4:10:30 2012 | Battery Life Remaining: 730 days

If required set Internal Disk Mirror if A or B Drives are NOT MIR State.

   $  show internal_disk
   $  assign INTERNAL_DISK <enclosure-id> <object-id> to_system_disk

Appendix

IME product support matrix

From IME release notes:

IME 1.0.0: EXAScaler 2.1.2
IME 1.1.1: EXAScaler 2.4.0-r10 (tested with IB FDR)
IME 1.1.2: Updated EXAScaler 2.4.0-r10 to EXAScaler 3.2.0

Insight product support matrix

Insight 1.0.0: EXAScaler 3.2.0
Insight 1.0.1: EXAScaler 3.2.0 or higher

TODO/WIP

Additional instructions

LVM volgroup activation

NOTE: this step assumes the LVM volgroups are same on all EXAscaler nodes

NOTE: no wildcards nor pattern matching is allowed
- list LVM volume groups used by the OS; volgroups are usually VolGroup00 and sometimes VolGroup01
Example:

   # vgs
   VG                  #PV #LV #SN Attr   VSize   VFree  
   VolGroup00            1   4   0 wz--n-  23.47g      0 
   vg_mdt0000_testfs20   1   1   0 wz--n- 512.00g 508.00m
   vg_mgs                1   1   0 wz--n- 508.00m 252.00m

- In exascaler.conf [global] section add the OS LVM volgroup names to entry *vg\_activiation\_list*

Example:

| [global] | vg_activation_list: VolGroup00

NOTE: do NOT add any volumes managed by HA to vg_activation_list

This parameter controls entry auto_activation_volume_list in /etc/lvm/lvm.conf

The "auto_activation_volume_list" is exclusive. When enabling this LVM parameter, if a volgroup is NOT listed, it won't be activated on boot.

Hint: make sure you list your /root volgroup.

On each Exascaler node run the es_install tool

Example:

   # es_install --debug --yes --steps lvm

email relay

In exascaler.conf [global] section add email information

On each Exascaler node run the es_install tool

Example:

   # es_install --debug --yes --steps email

Project Quotas

Support project quotas "space accounting" in ldiskfs backend

edit exascaler.conf

Add "-O project" to mkfs command.

edit exascaler.conf, in each [fs ] section append mke2fs_opts entries:

| [fs testfs] | # mke2fs options for the OSTs | ost_mke2fs_opts: -m1 -i 131072 -O project

On each Exascaler node run es_tunefs & es_tune_lustre tools

Example:

   # es_tunefs --dry-run
   # es_tune_lustre --dry-run

Enable Lustre jobstats

Enable Lustre jobstats feature

edit exascaler.conf

Replace <fsname> with name of lustre filesystem

| [conf_param_tunings] | <fsname>.sys.jobid_var: procname_uid

On each Exascaler node run es_tune_lustre tool

Example:

   # es_tune_lustre --dry-run

Lustre Progressive File Layout

Set filesystem default lustre striping.

manpage lfs-setstripe:

"If the default file layout is set on the filesystem root directory, it will be used as the filesystem-wide default layout for all files that do not explicitly specify a layout and do not have a default layout on the parent directory. The default layout set on a directory will be copied to any new subdirectories created within that directory at the time they are created."

PFL command syntax:

| lfs setstripe <--component-end|-E end1> [STRIPE_OPTIONS] [<--component-end|-E end2> [STRIPE_OPTIONS] ...] <filename>

example lustre striping

Run lfs setstripe command on lustre client with lustre filesystem mounted

    $ lfs setstripe \
    -E 16M --stripe-count=1 --stripe-size=16M \
    -E 64M  --stripe-count=4 --stripe-size=16M \
    -E 256M --stripe-count=8 --stripe-size=32M \
    -E -1 --stripe-count=16 --stripe-size=32M /lustre/testfs/client

Optional Tasks

[Optional] EF Enclosure firmware upgrade

EF firmware can be upgraded via WebUI. Alternatively upgrade via console/CLI using these instrucions. CLI upgrade requires external host with ftp client to upload flash file.

NOTE: EF firmware updates can 40 min when PFU is enabled

NOTE: WebUI & CLI default credentials are manage/!manage

NOTE: ftp default credentials are ftp/!ftp

Current EF fw versions:
- EF3015 Firmware TS252P006 (2017 Sept)
- EF4024 Firmware GL222R050 (2017 Sept)
Firmware Links (druva login required)
- {DDN-SWREPO}EF Series sync.ddn.com

Avis

Do not cycle power or restart devices during a firmware update. If the update is interrupted or there is a power failure, the module could become inoperative. If this occurs, contact DDN support.
For single controller/single domain systems, I/O must be halted (ie. outage required) for upgrade.
In dual-module enclosures, both controllers or both I/O modules must have same firmware version.
Set "Partner Firmware Update option" so that, in dual-controller systems, both controllers are updated. When the Partner Firmware Update option is enabled, after the installation process completes and restarts the first controller, the system automatically installs the firmware and restarts the second controller.
For dual controller systems, because the online firmware upgrade is performed while host I/Os are being processed, I/O performance is impacted during the upgrade process.

WebGui Instructions

WebUI Health Check

In the Configuration View panel, select the System tab and then select

| View > Overview

| The System Overview table shows:

| The system’s health:

| OK

| Degraded

| Fault

| Unknown

WebUI firmware update

NOTE: Confirm that all I/O to this storage system has been halted before starting

Restart the Management Controller component within both system controller modules:

from the System tab in the GUI, select button Action > Restart System

The Controller Restart and Shut Down panel then opens:
- Select the Restart operation
- Select the controller type to restart: Management
- Select both controller modules (A + B)
  - Click OK. A confirmation panel appears
  - Click Yes to continue, a message will describe the restart activity
Once controllers have been fully restarted, navigate to the GUI System tab and select Action > Update Firmware
Click Browse and select the firmware file to upload

NOTE: If Controller cannot be updated, the update operation is cancelled. Verify that you specified the correct firmware file and repeat the update.
Click OK, a pop-up panel will appear to show upload progress

NOTE: Do not perform a power cycle or controller restart during a firmware update. If the update is interrupted or there is a power failure, the module might become inoperative.
When the update is complete, clear the history from your local web browser, then sign into the GUI.

When Partner Firmware Upgrade(PFU) feature is enabled, a panel will display progress and the GUI will prevent other tasks until update of partner controller is complete.
Allow up to 20 minutes for the PFU cycle to complete.

Once the PFU tasks have completed, confirm system status in the GUI, collect system logs and submit to technical support.

CLI Instructions

CLI Health Check

Check DotHill enclosure health via ssh

default operator credentials manage/!manage

# echo -e "show system" | ssh -T manage@IP_ADDRESS_OF_DOTHILL_ENCLOSURE
  Password: 
  DDN EF3000 EF3015
  System Name: Eng-EF3015
  Version: TS251R004-05
.
  # show system
  System Information
  ------------------
.
  Health: OK
  Health Reason:

CLI Firmware update

Download & extract firmware binary file
- EF3015 name has format: TSxxxRyyy-zz.bin
- EF4024 name has format: GLxxxRyyy-zz.bin
Determine network-port IP addresses of the system controllers. Login to controller via ssh

 # ssh manage@IP_ADDRESS_OF_EF_ENCLOSURE

Determine current FW version (EF3015 firmware name format: TSxxxRyyy-zz)

 # show version
 Controller A Versions
 ---------------------
 Bundle Version: TS251R004-05
. 
 Controller B Versions
 ---------------------
 Bundle Version: TS251R004-05

Verify FTP service is enabled

 # show protocols  
 Service and Security Protocols
 ------------------------------
. 
 File Transfer Protocol (FTP): Enabled

Verify that user "ftp" has permission to FTP service and has manage access rights.

 # show user ftp
 Username             Roles                            User Type    User Locale           WBI   CLI   FTP   SMI-S SNMP
 Authentication Type Privacy Type Password                         Privacy Password                 Trap Host Address
 ------------------------------------------------------------------------------------------------------------------
 ftp                  manage,monitor                   Standard     English                            x           
                                    ********                         ********

(Dual controller enclosure) enable partner firmware update:

 # set job-parameters partner-firmware-upgrade enabled
 Info: Parameter 'partner-firmware-upgrade' was set to 'enabled'. (2017-02-17 14:31:28)
.
 Success: Command completed successfully. - The settings were changed successfully. (2017-02-17 14:31:28)

from external Linux host, upload firmware to EF controller as filename "flash".

Log in with user ftp (user = ftp, password = !ftp).

 $ ftp <EF_IP_ADDRESS>
 $ bin
 $ put <TSxxxRyyy-zz.bin> flash

After file upload completes, firmware update will start.

Wait for the installation to complete. During installation, each updated module automatically restarts.

Example EF3015 firmware update:

 $ ftp> put ./TS252P005.bin flash
 local: ./TS252P005.bin remote: flash
 227 Entering Passive Mode (192,168,40,231,56,94)
 150 Accepted data connection
 226-File Transfer Complete. Starting Operation:
 Checking component list
 mc bundle component check passed.
 Checking bundle integrity...
 Initial mc file integrity checks passed.
 Checking system health.
 System health check complete. Health state: OK
 Stopping Management Controller applications.
 Starting message server
 Initial connection to SC successful.
...
 STATUS: Updating Storage Controller firmware.
 Controller current bundle: TS251R004-05,loading bundle TS252P005
 Instructing SC to shut down and reboot when finished updating
 Waiting 5 seconds for SC to shutdown.
 Shutdown of SC successful.
 Sending new firmware to SC.
 Waiting for Storage Controller to complete programming.
 Please wait...
 Storage Controller has completed programming.
 Updating SC Image:Remaining size 0
 Waiting for SC reboot.
...
 Storage Controller has rebooted.
 Storage Controller has been updated, proceeding to next step.
 STATUS: Updating Management Controller firmware
...
 Finished updating Management Controller firmware
 Updating system configuration files
 System configuration complete
.
 ==========================================
 Software Component Load Summary:
.
 MC Software: SUCCESSFUL
 SC Software: SUCCESSFUL
 EC Software: NOT ATTEMPTED
 Expansion Software: NOT ATTEMPTED
 ==========================================
.
 Code load completed successfully.
 Restarting Management Controller...
 Rebooting...

If Partner Firmware Update is disabled, after updating firmware on one controller, you must manually update the second EF controller.
Review EF event logs for firmware update event codes [269] & [237]

 # show events
 2017-02-17 15:59:32 [269] #A1435: EF3015 Array SN#00C0FF142609 Controller A INFORMATIONAL Partner Firmware Update progress: PFU completed on local controller, SUCCESS (info: p1: 5, p2: 0, p3: 0, p4: 0)   
 2017-02-17 15:59:32 [269] #A1434: EF3015 Array SN#00C0FF142609 Controller A INFORMATIONAL Partner Firmware Update progress: PFU send package done, SUCCESS (info: p1: 17, p2: 0, p3: 0, p4: 0)   
 2017-02-17 15:59:32 [237] #B1386: EF3015 Array SN#00C0FF142609 Controller B INFORMATIONAL Firmware update progress: The SC app was updated. Saved in primary location in flash. The firmware is different so it was flashed; flashed successfully.   
 2017-02-17 15:58:45 [237] #B1385: EF3015 Array SN#00C0FF142609 Controller B INFORMATIONAL Firmware update progress: The firmware was verified. (from MC: no, for MC: no)   
 2017-02-17 15:58:34 [269] #A1433: EF3015 Array SN#00C0FF142609 Controller A INFORMATIONAL Partner Firmware Update progress: PFU sending package to partner SC,  (info: p1: 16, p2: 0, p3: 0, p4: 0)

EF Enclosure Background vdisk Scrub

This step is required to protect your data.

NOTE: required minimum firmware version EF3015 TS252P005 or EF4024 GL222R050

Log in to the EF enclosure CLI as administrator (user ID manage, default password !manage) and enter the line command:

 # set job-parameters background-scrub enabled
 Info: Parameter 'background-scrub' was set to 'enabled'. (2017-02-17 14:29:22)
.
 Success: Command completed successfully. - The settings were changed successfully. (2017-02-17 14:29:22)

Check the status of the vdisk background scrub with the command:

 # show job-parameters
 Job Parameters
 --------------
 Vdisk Background Scrub: Enabled

Review EF enclosure event logs after completion of scrub to determine whether data integrity issues were found.

Log in to the CLI as administrator, then enter the line command:

 # show events

[Optional] Upgrade HPE Proliant Servers Firmware & Drivers

HPE Proliant servers require updated device drivers to support Redhat 7.4 or newer

Important: EXAScaler 3.3(based on CentOS 7.4) will not function unless Proliant firmware & drivers are updated.

Download and install HPE Supplimental Supplement Service Pack for ProLiant 2018.03.0 after re-imaging EXAScaler software.

SPP 2018.03.0 Release Notes

ProLiant Gen6 & Gen7 Models RHEL7 Unsupported

HPE Proliant Gen6 & Gen7 servers are not supported with EXAScaler 3.x. HPE has not provided RHEL7 compatible device drivers for older generation of SmartArray storage controllers based on cciss device driver (eg. P400i).

From HPE whitepaper:

| HPE Smart Array Controller models no longer supported in Red Hat Enterprise Linux 7 include: P400, P400i, P800, E200, E200i, P700m, 6400, 641, 642, and 6i

HPE whitepaper for Proliant RHEL7 support
Redhat KB article for older generation HPE SmartArray controllers

HPE Support for Redhat Linux 7

Proliant Redhat Linux 7 support appears defined by Linux processor support. HPE ProLiant Gen10 (Skylake) supports Redhat 7.3 or newer. HPE ProLiant Gen9 (Broadwell) supports Redhat Linux 7.2 or newer.

HPE Redhat Linux Product Support Matrix

Proliant servers with P22x, P41x or P42x "smart array" storage controllers require firmware 8.00 or newer for RHEL7. https://support.hpe.com/hpsc/swd/public/detail?sp4ts.oid=5295169&swItemId=MTX_42b6aa58956a438aa85bd73d0f&swEnvOid=4184 [Optional] Mellanox HCA firmware update

NOTE: use mlxup v4.4.0 to apply HCA firmware version listed in EXAscaler 3.2.0 release notes

Determine current HCA firmware version

# clush -a 'ibstat | egrep -i "CA |firmware"'

Compare the above Mellanox version to the release notes for the EXAScaler version

If server HCA firmware is below recommended level, proceed with upgrade

Update HCA firmware

Below we assume mlxup tool located in /root/EXAScaler-3.2.0/mlxup; you may need to download mlxup tool from mellanox.com (see URL above)

# chmod +x /root/EXAScaler-3.2.0/mlxup
# /root/EXAScaler-3.2.0/mlxup --query
# /root/EXAScaler-3.2.0/mlxup --update

After firmware completes, reboot server

repeat firmware update steps for other EXAScaler server nodes

Upgrade Mellanox Firmware with flint

NOTE: This Mellanox fw applies to server HCA connecte to the SFA storage, NOT HCA connected to clients

Determine if an upgrade to the Mellanox drivers is necessary.

# clush -a "ibstat"

Compare the above Mellanox driver versions to release notes for the current SFAOS version. Continue with if the versions on the servers nodes do not match the recommended HCA firmware version

Getting firmware from Mellanox support downloader:

a. find PSID from ibv_devinfo.txt file in es_showall logs. The board_id is the PSID

# grep board_id ibv_devinfo.txt
  board_id:                       MT_1090120019
  board_id:                       MT_1090120019

b. Download HCA firmware matching PSID from http://www.mellanox.com/supportdownloader/

To download older firmware:

under "Select a Family" column select "Adapter Cards"
Under "Select a Line" column select the card name found from inputting the PSID
Under "Select an OPN" column select the OPN from the firmware file name given for from inputting the PSID
Under "Product Support Information" column click "Check for older versions"

Example:

fw-ConnectX3-rel-2_36_5000-MCX354A-FCB_A2-A5-FlexBoot-3.4.718.bin.zip
- the OPN is "MCX354A-FCB"
- Under "Select a PSID (Rev)" column select the PSID that matches the PSID from ibv_devinfo.txt
- Under "Product Support Information" column click "Check for older versions"

c. Download the appropriate firmware version and release notes

Flash the MLX adapters:

a. copy .bin firmware file to server

b. On first ES Server run mst start on server

   # mst start

c. run mst status to get device names ex: /dev/mst/mt4099_pciconf0

   # mst status

d. use the flint burn command with the –d flag and device path and the –i flag and firmware file name to burn the new fw to the card (upgrade the fw)

   # flint –d <DEVICE_NAME> –i <MELLANOX_FIRMWARE.BIN> burn

e. reboot server

f. after reboot, run ibstat to confirm new fw version

   # ibstat

g. run lsscsi -l to ensure access to the SFA devices

   # lsscsi -l

Mellanox Firmware Matrix

Link to MLNX_OFED Firmware Driver Compatibility Matrix

[Optional] Upgrade OmniPath HFI firmware

Update the firmware on the Omni-Path Host Fabric Interface (HFI)

NOTE: HFI firmware can be updated on embedded systems

There are three files available for Option ROM EPROM partitions. These default files are packaged with Intel Fabric Suite (IFS) and Basic releases. See the Intel Omni- Path Fabric Software Release Notes for the version provided in the release. The files are:

HFI1 UEFI Option ROM: HfiPcieGen3_x.x.x.x.x.efi
UEFI UNDI Loader: HfiPcieGen3Loader_x.x.x.x.x.rom
HFI1 platform file: hfi1_platform.dat

The HFI UEFI firmware is packaged in RPM hfi1-uefi.x86_64 and bundled with Omni-Path Fabric Software (OFS) "basic" package.

NOTE: The included hfi1_platform.dat file is for Intel HFI adapters. If your HFI adapter is from another manufacturer, you may require different hfi1_platform.dat file. Contact your manufacturer's support team to confirm.

Your HFI may also have Thermal Monitoring Module (TMM) firmware, an optional micro-controller for thermal monitoring on vendor-specific HFI adapters using the SMBus.

To upgrade the HFI firmware, perform the following steps:

review fabric software release notes for any special instructions
confirm HFI UEFI rpm installed

   rpm -qa | grep -i hfi1-uefi

confirm firmware files in /usr/share/opa/bios_images

   # ls -l /usr/share/opa/bios_images
   -rw-r--r-- 1 root root 489168 Oct 4 08:44 HfiPcieGen3_1.6.0.0.0.efi 
   -rw-r--r-- 1 root root 65024 Oct 4 08:44 HfiPcieGen3Loader_1.6.0.0.0.rom 
   -rw-r--r-- 1 root root 19530 Oct 4 08:44 License_UEFI_Option_ROM 
   -rw-r--r-- 1 root root 252298 Oct 4 08:44 License_UEFI_Option_ROM.pdf

confirm platform file in /lib/firmware/updates/

   # ls -l /lib/firmware/updates/hfi1_platform.dat

determine the device path to be used in command

   # hfi1_eprom -v 
   Using default device: /sys/bus/pci/devices/0000:04:00.0/resource0

check existing driver version

   # hfi1_control -i
   Driver Version: 0.9-294
   Driver SrcVersion: A08826F35C95E0E8A4D949D
   Opa Version: 10.3.0.0.81
   0: BoardId: Intel Omni-Path Host Fabric Interface Adapter 100 Series
   0: Version: ChipABI 3.0, ChipRev 7.17, SW Compat 3
   0: ChipSerial: 0x00790311
   0,1: Status: 5: LinkUp 4: ACTIVE
   0,1: LID=0x1 GUID=0011:7501:0179:0311

   # hfi1_eprom -V -b 
   Using device: /sys/bus/pci/devices/0000:04:00.0/resource0 
   driver file version: 1.4.2.0.0

   # hfi1_eprom -V -o 
   Using device: /sys/bus/pci/devices/0000:04:00.0/resource0 
   loader file version: 1.4.2.0.0

   # hfi1_eprom -V -c

run upgrade command as instructed in OmniPath release notes

   # cd /usr/share/opa/bios_images
   # hfi1_eprom -w -o HfiPcieGen3Loader_1.6.0.0.0.rom -b HfiPcieGen3_1.6.0.0.0.efi -c /lib/firmware/updates/hfi1_platform.dat

   Using device: /sys/bus/pci/devices/0000:04:00.0/resource0
   Erasing loader file... done
   Writing loader file... done
   Erasing driver file... done
   Writing driver file... done

validate TMM firmware file hfi1_smbus.fw

   # opatmmtool -f /lib/firmware/updates/hfi1_smbus.fw fileversion

check current TMM version

   # opatmmtool -fwversion

   # opahfirev
   rdma-qe-15  - HFI 0000:04:00.0
   HFI:   hfi1_0
   Board: ChipABI 3.0, Board ID 0x1, ChipRev 7.17, SW Compat 3
   SN:    0x00790311
   Location:Discrete Socket:1 PCISlot:00 NUMANode:1 HFI0
   Bus:   Speed 8GT/s, Width x16
   GUID:  0011:7501:0179:0311
   SiRev: B1 (11)
   TMM:   10.0.0.0.696

update TMM if required

   # opatmmtool -f /lib/firmware/updates/hfi1_smbus.fw update

   opatmmtool: Opened the driver interface
   File Firmware Version=10.2.1.0.3
   opatmmtool: Firmware length=51468
   opatmmtool: Waiting for device to erase flash...
   opatmmtool: Transmitting firmware
   opatmmtool: Successfully transmitted firmware to device
   opatmmtool: Firmware transmitted, wait for device ready
   Current Firmware Version=10.2.1.0.3
   Firmware Update Completed

restart TMM micro-controller

   # opatmmtool reboot

Reboot host & verify firmware version

   # hfi1_eprom -V -b 
   # hfi1_eprom -V -o
   # hfi1_eprom -V -c
   # opatmmtool -fwversion

OmniPath Firmware Matrix

For reference Omnipath HFI firmware versions documented in Intel OPA release notes:

OPA version	Date	HFI UEFI fw	HFI TMM fw	source document
10.3.1	2017 Feb	1.3.2.0.0	10.2.1.0.3	Intel_OP_Software_RN_J52019_v1_0.pdf
10.3.2	2017 Sept	1.3.2.0.0	10.2.1.0.3	Intel_OP_Software_10_3_2_RN_J64261_v2_0.pdf
10.4.1	2017 May	1.4.0.0.0	10.4.0.0.146	Intel_OP_Software_10_4_1_RN_J64255_v1_0.pdf
10.4.2	2017 June	1.4.2.0.0	10.4.0.0.146	Intel_OP_Software_10_4_2_RN_J66909_v1_0.pdf
10.5	2017 Sept	1.5.2.0.0	10.4.0.0.146	Intel_OP_Software_10_5_RN_J75208_v3_0.pdf
10.6	2017 Oct	1.6.0.0.0	10.4.0.0.146	Intel_OP_Software_10_6_RN_J82662_v1_0.pdf

[Optional] Lustre rpm update

NOTE: Ensure the rpm packages have been approved by DDN engineering

NOTE: Recommended to use pre-built rpm packages available from DDN engineering. The latest ES 3.2.0 rpm packages can be obtained from EXAScaler jenkins or ES Jenkins downstream build (DDN VPN access required)

ES 3.2.0 rpm upgrade list: kmod-lustre-common, kmod-lustre-el7.3, kmod-lustre-el7.3-ldiskfs, kmod-lustre-el7.3-mlnx3.4-o2ib-mlnx, kmod-lustre-el7.3-osd-ldiskfs, lustre, lustre-devel, lustre-iokit, lustre-osd-ldiskfs-mount, lustre-server, lustre-source, lustre-tests

eg) upgrade ES 3.2.0 (lustre-2.7.21.3-18.ddn8) to lustre-2.7.21.3-90.ddn11

Example

# yum upgrade  kmod-lustre-common.x86_64 0:2.7.21.3-90.ddn11.g83d7061.el7.rpm \
kmod-lustre-el7.3.x86_64 0:2.7.21.3-90.ddn11.g83d7061.el7.rpm \
kmod-lustre-el7.3-ldiskfs.x86_64 0:2.7.21.3-90.ddn11.g83d7061.el7.rpm \
kmod-lustre-el7.3-mlnx3.4-o2ib-mlnx.x86_64 0:2.7.21.3-90.ddn11.g83d7061.el7.rpm \
kmod-lustre-el7.3-osd-ldiskfs.x86_64 0:2.7.21.3-90.ddn11.g83d7061.el7.rpm \
lustre.x86_64 0:2.7.21.3-90.ddn11.g83d7061.el7.rpm \
lustre-devel.x86_64 0:2.7.21.3-90.ddn11.g83d7061.el7.rpm \
lustre-iokit.x86_64 0:2.7.21.3-90.ddn11.g83d7061.el7.rpm \
lustre-osd-ldiskfs-mount.x86_64 0:2.7.21.3-90.ddn11.g83d7061.el7.rpm \
lustre-server.x86_64 0:2.7.21.3-90.ddn11.g83d7061.el7.rpm \
lustre-source.x86_64 0:2.7.21.3-90.ddn11.g83d7061.el7.rpm \                   
lustre-tests.x86_64 0:2.7.21.3-90.ddn11.g83d7061.el7.rpm

ES 3.2.0 kmod-lustre & lustre upgrade list:

kmod-lustre-common, kmod-lustre-el7.3, kmod-lustre-el7.3-ldiskfs, kmod-lustre-el7.3-mlnx3.4-o2ib-mlnx, kmod-lustre-el7.3-osd-ldiskfs, lustre, lustre-devel, lustre-iokit, lustre-osd-ldiskfs-mount, lustre-server, lustre-source, lustre-tests

Example update to ddn-lustre 2.7.21.3.ddn25

# yum kmod-lustre-* lustre-*

...
Resolving Dependencies
--> Running transaction check
---> Package kmod-lustre-common.x86_64 0:2.7.21.3-256.ddn20.g10dd357.el7 will be updated
---> Package kmod-lustre-common.x86_64 0:2.7.21.3-272.ddn25.g9b5a642.el7 will be an update
---> Package kmod-lustre-el7.4.x86_64 0:2.7.21.3-256.ddn20.g10dd357.el7 will be updated
---> Package kmod-lustre-el7.4.x86_64 0:2.7.21.3-272.ddn25.g9b5a642.el7 will be an update
---> Package kmod-lustre-el7.4-ldiskfs.x86_64 0:2.7.21.3-256.ddn20.g10dd357.el7 will be updated
---> Package kmod-lustre-el7.4-ldiskfs.x86_64 0:2.7.21.3-272.ddn25.g9b5a642.el7 will be an update
---> Package kmod-lustre-el7.4-mlnx4.3-o2ib-mlnx.x86_64 0:2.7.21.3-256.ddn20.g10dd357.el7 will be updated
---> Package kmod-lustre-el7.4-mlnx4.3-o2ib-mlnx.x86_64 0:2.7.21.3-272.ddn25.g9b5a642.el7 will be an update
---> Package kmod-lustre-el7.4-osd-ldiskfs.x86_64 0:2.7.21.3-256.ddn20.g10dd357.el7 will be updated
---> Package kmod-lustre-el7.4-osd-ldiskfs.x86_64 0:2.7.21.3-272.ddn25.g9b5a642.el7 will be an update
---> Package lustre.x86_64 0:2.7.21.3-256.ddn20.g10dd357.el7 will be updated
---> Package lustre.x86_64 0:2.7.21.3-272.ddn25.g9b5a642.el7 will be an update
---> Package lustre-devel.x86_64 0:2.7.21.3-256.ddn20.g10dd357.el7 will be updated
---> Package lustre-devel.x86_64 0:2.7.21.3-272.ddn25.g9b5a642.el7 will be an update
---> Package lustre-iokit.x86_64 0:2.7.21.3-256.ddn20.g10dd357.el7 will be updated
---> Package lustre-iokit.x86_64 0:2.7.21.3-272.ddn25.g9b5a642.el7 will be an update
---> Package lustre-osd-ldiskfs-mount.x86_64 0:2.7.21.3-256.ddn20.g10dd357.el7 will be updated
---> Package lustre-osd-ldiskfs-mount.x86_64 0:2.7.21.3-272.ddn25.g9b5a642.el7 will be an update
---> Package lustre-server.x86_64 0:2.7.21.3-256.ddn20.g10dd357.el7 will be updated
---> Package lustre-server.x86_64 0:2.7.21.3-272.ddn25.g9b5a642.el7 will be an update
---> Package lustre-source.x86_64 0:2.7.21.3-256.ddn20.g10dd357.el7 will be updated
---> Package lustre-source.x86_64 0:2.7.21.3-272.ddn25.g9b5a642.el7 will be an update
---> Package lustre-tests.x86_64 0:2.7.21.3-256.ddn20.g10dd357.el7 will be updated
---> Package lustre-tests.x86_64 0:2.7.21.3-272.ddn25.g9b5a642.el7 will be an update
--> Finished Dependency Resolution

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
README.md		README.md

raj76/test

Folders and files

Latest commit

History

Repository files navigation

Upgrade Type and Resource Requirements

Plan Assumptions

Exascaler

SFA

Summary of work / Order / Timing

System Details

Upgrade Component List and Links

EXAScaler Upgrade Items Summary

Mellanox HCA firmware updater tool: mlxup v4.9.0

Detailed Process

SFA pre-upgrade work

SFA External Block Pre-upgrade

EXAScaler pre-upgrade work

EXAScaler Health Check

Client node Health checks

Server Health Checks

Prep for ExaScaler Upgrade

Unmount all compute clients and unload lustre modules

Server Halt Services

Upgrade ExaScaler Servers

Update exascaler.conf

External Block SRP configuration

External Block Multipath configuration

Infiniband IPoIB Configuration

Set IPoIB datagram mode

ES Install steps

Post upgrade mount and checks

Additional Upgrade Tasks

HA Configuration

pacemaker HA configuration

HA startup

EXAScaler post work

SFA post work

Appendix

IME product support matrix

Insight product support matrix

TODO/WIP

edit exascaler.conf

edit exascaler.conf

example lustre striping

Optional Tasks

[Optional] EF Enclosure firmware upgrade

Avis

WebGui Instructions

WebUI Health Check

WebUI firmware update

CLI Instructions

CLI Health Check

CLI Firmware update

EF Enclosure Background vdisk Scrub

[Optional] Upgrade HPE Proliant Servers Firmware & Drivers

ProLiant Gen6 & Gen7 Models RHEL7 Unsupported

HPE Support for Redhat Linux 7

Proliant servers with P22x, P41x or P42x "smart array" storage controllers require firmware 8.00 or newer for RHEL7. https://support.hpe.com/hpsc/swd/public/detail?sp4ts.oid=5295169&swItemId=MTX_42b6aa58956a438aa85bd73d0f&swEnvOid=4184 [Optional] Mellanox HCA firmware update

Upgrade Mellanox Firmware with flint

Mellanox Firmware Matrix

[Optional] Upgrade OmniPath HFI firmware

OmniPath Firmware Matrix

[Optional] Lustre rpm update

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Packages