Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

inflate() failed with eror -4: incorrect header check #49

Open
RumitAP opened this issue Jun 7, 2022 · 5 comments
Open

inflate() failed with eror -4: incorrect header check #49

RumitAP opened this issue Jun 7, 2022 · 5 comments

Comments

@RumitAP
Copy link

RumitAP commented Jun 7, 2022

Background:

I am trying to test whether GPU-intensive ML programs can be run faster/cheaper than on an A100 by running on an HPC, distributed platform by spreading the job over multiple CPU nodes. I am running the job on V0.7 small data set and am running on CosmoFlow TensorFlow Keras benchmark implementation.

Command Used to Run Job locally:

python3 train.py --data-dir cosmoUniverse_2019_05_4parE_tf_small --n-train 32 --n-valid 32 --batch-size 2

Error Log:

2022-06-07 12:58:00.404289: I tensorflow/core/util/util.cc:169] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
2022-06-07 12:58:00.407257: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory
2022-06-07 12:58:00.407270: I tensorflow/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine.
2022-06-07 12:58:01,359 INFO Initialized rank 0 size 1 local_rank 0 local_size 1
2022-06-07 12:58:01,359 INFO Configuration: {'output_dir': 'results/cosmo-002', 'mlperf': {'org': 'LBNL', 'division': 'closed', 'status': 'onprem', 'platform': 'SUBMISSION_PLATFORM_PLACEHOLDER'}, 'data': {'name': 'cosmo', 'data_dir': 'cosmoUniverse_2019_05_4parE_tf_small', 'compression': 'GZIP', 'n_train': 32, 'n_valid': 32, 'sample_shape': [128, 128, 128, 4], 'batch_size': 2, 'n_epochs': 128, 'shard': True, 'apply_log': True, 'prefetch': 4}, 'model': {'name': 'cosmoflow', 'input_shape': [128, 128, 128, 4], 'kernel_size': 3, 'target_size': 4, 'conv_size': 32, 'fc1_size': 128, 'fc2_size': 64, 'hidden_activation': 'LeakyReLU', 'pooling_type': 'MaxPool3D', 'dropout': 0.5}, 'optimizer': {'name': 'SGD', 'momentum': 0.9}, 'lr_schedule': {'base_lr': 0.001, 'scaling': 'linear', 'base_batch_size': 64, 'n_warmup_epochs': 4, 'decay_schedule': {32: 0.25, 64: 0.125}}, 'train': {'loss': 'mse', 'metrics': ['mean_absolute_error']}}
2022-06-07 12:58:01,359 INFO KMP_BLOCKTIME 
2022-06-07 12:58:01,360 INFO KMP_AFFINITY 
2022-06-07 12:58:01,360 INFO OMP_NUM_THREADS 
2022-06-07 12:58:01,360 INFO INTRA_THREADS 32
2022-06-07 12:58:01,360 INFO INTER_THREADS 2
2022-06-07 12:58:01,360 INFO Loading data
:::MLLOG {"namespace": "", "time_ms": 1654621081360, "event_type": "POINT_IN_TIME", "key": "global_batch_size", "value": 2, "metadata": {"file": "/home/rap/bin/cosmoflow-benchmark/data/cosmo.py", "lineno": 162}}
:::MLLOG {"namespace": "", "time_ms": 1654621081462, "event_type": "POINT_IN_TIME", "key": "train_samples", "value": 32, "metadata": {"file": "/home/rap/bin/cosmoflow-benchmark/data/cosmo.py", "lineno": 163}}
:::MLLOG {"namespace": "", "time_ms": 1654621081462, "event_type": "POINT_IN_TIME", "key": "eval_samples", "value": 32, "metadata": {"file": "/home/rap/bin/cosmoflow-benchmark/data/cosmo.py", "lineno": 164}}
2022-06-07 12:58:01.463083: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcuda.so.1'; dlerror: libcuda.so.1: cannot open shared object file: No such file or directory
2022-06-07 12:58:01.463098: W tensorflow/stream_executor/cuda/cuda_driver.cc:269] failed call to cuInit: UNKNOWN ERROR (303)
2022-06-07 12:58:01.463111: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:156] kernel driver does not appear to be running on this host (ThinkPad): /proc/driver/nvidia/version does not exist
2022-06-07 12:58:01.463329: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 AVX512F AVX512_VNNI FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
:::MLLOG {"namespace": "", "time_ms": 1654621081466, "event_type": "INTERVAL_START", "key": "staging_start", "value": null, "metadata": {"file": "/home/rap/bin/cosmoflow-benchmark/data/cosmo.py", "lineno": 172}}
:::MLLOG {"namespace": "", "time_ms": 1654621081466, "event_type": "INTERVAL_END", "key": "staging_stop", "value": null, "metadata": {"file": "/home/rap/bin/cosmoflow-benchmark/data/cosmo.py", "lineno": 191}}
2022-06-07 12:58:01,548 INFO Splitting data into 1 worker shards
2022-06-07 12:58:01,548 INFO Each worker reading 32 training samples and 32 validation samples
2022-06-07 12:58:01,548 INFO Data setting n_train: 32
2022-06-07 12:58:01,548 INFO Data setting n_valid: 32
2022-06-07 12:58:01,548 INFO Data setting batch_size: 2
2022-06-07 12:58:01,548 INFO Data setting compression: GZIP
2022-06-07 12:58:01,548 INFO Data setting prefetch: 4
2022-06-07 12:58:01,548 INFO Building the model
:::MLLOG {"namespace": "", "time_ms": 1654621081549, "event_type": "POINT_IN_TIME", "key": "opt_weight_decay", "value": 0, "metadata": {"file": "/home/rap/bin/cosmoflow-benchmark/models/cosmoflow.py", "lineno": 53}}
:::MLLOG {"namespace": "", "time_ms": 1654621081551, "event_type": "POINT_IN_TIME", "key": "dropout", "value": 0.5, "metadata": {"file": "/home/rap/bin/cosmoflow-benchmark/models/cosmoflow.py", "lineno": 54}}
:::MLLOG {"namespace": "", "time_ms": 1654621081647, "event_type": "POINT_IN_TIME", "key": "opt_name", "value": "SGD", "metadata": {"file": "/home/rap/bin/cosmoflow-benchmark/utils/optimizers.py", "lineno": 109}}
Model: "sequential"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
=================================================================
 conv3d (Conv3D)             (None, 128, 128, 128, 32  3488      
                             )                                   
                                                                 
 leaky_re_lu (LeakyReLU)     (None, 128, 128, 128, 32  0         
                             )                                   
                                                                 
 max_pooling3d (MaxPooling3D  (None, 64, 64, 64, 32)   0         
 )                                                               
                                                                 
 conv3d_1 (Conv3D)           (None, 64, 64, 64, 64)    55360     
                                                                 
 leaky_re_lu_1 (LeakyReLU)   (None, 64, 64, 64, 64)    0         
                                                                 
 max_pooling3d_1 (MaxPooling  (None, 32, 32, 32, 64)   0         
 3D)                                                             
                                                                 
 conv3d_2 (Conv3D)           (None, 32, 32, 32, 128)   221312    
                                                                 
 leaky_re_lu_2 (LeakyReLU)   (None, 32, 32, 32, 128)   0         
                                                                 
 max_pooling3d_2 (MaxPooling  (None, 16, 16, 16, 128)  0         
 3D)                                                             
                                                                 
 conv3d_3 (Conv3D)           (None, 16, 16, 16, 256)   884992    
                                                                 
 leaky_re_lu_3 (LeakyReLU)   (None, 16, 16, 16, 256)   0         
                                                                 
 max_pooling3d_3 (MaxPooling  (None, 8, 8, 8, 256)     0         
 3D)                                                             
                                                                 
 conv3d_4 (Conv3D)           (None, 8, 8, 8, 512)      3539456   
                                                                 
 leaky_re_lu_4 (LeakyReLU)   (None, 8, 8, 8, 512)      0         
                                                                 
 max_pooling3d_4 (MaxPooling  (None, 4, 4, 4, 512)     0         
 3D)                                                             
                                                                 
 flatten (Flatten)           (None, 32768)             0         
                                                                 
 dense (Dense)               (None, 128)               4194432   
                                                                 
 leaky_re_lu_5 (LeakyReLU)   (None, 128)               0         
                                                                 
 dropout (Dropout)           (None, 128)               0         
                                                                 
 dense_1 (Dense)             (None, 64)                8256      
                                                                 
 leaky_re_lu_6 (LeakyReLU)   (None, 64)                0         
                                                                 
 dropout_1 (Dropout)         (None, 64)                0         
                                                                 
 dense_2 (Dense)             (None, 4)                 260       
                                                                 
 lambda (Lambda)             (None, 4)                 0         
                                                                 
=================================================================
Total params: 8,907,556
Trainable params: 8,907,556
Non-trainable params: 0
_________________________________________________________________
2022-06-07 12:58:01,652 INFO Writing config via pickle to results/cosmo-002/config.pkl
2022-06-07 12:58:01,652 INFO Preparing callbacks
:::MLLOG {"namespace": "", "time_ms": 1654621081652, "event_type": "POINT_IN_TIME", "key": "opt_base_learning_rate", "value": 3.125e-05, "metadata": {"file": "/home/rap/bin/cosmoflow-benchmark/utils/optimizers.py", "lineno": 91}}
:::MLLOG {"namespace": "", "time_ms": 1654621081652, "event_type": "POINT_IN_TIME", "key": "opt_learning_rate_warmup_epochs", "value": 4, "metadata": {"file": "/home/rap/bin/cosmoflow-benchmark/utils/optimizers.py", "lineno": 92}}
:::MLLOG {"namespace": "", "time_ms": 1654621081652, "event_type": "POINT_IN_TIME", "key": "opt_learning_rate_warmup_factor", "value": 0.03125, "metadata": {"file": "/home/rap/bin/cosmoflow-benchmark/utils/optimizers.py", "lineno": 93}}
:::MLLOG {"namespace": "", "time_ms": 1654621081652, "event_type": "POINT_IN_TIME", "key": "opt_learning_rate_decay_boundary_epochs", "value": [32, 64], "metadata": {"file": "/home/rap/bin/cosmoflow-benchmark/utils/optimizers.py", "lineno": 94}}
:::MLLOG {"namespace": "", "time_ms": 1654621081652, "event_type": "POINT_IN_TIME", "key": "opt_learning_rate_decay_factor", "value": 0.25, "metadata": {"file": "/home/rap/bin/cosmoflow-benchmark/utils/optimizers.py", "lineno": 96}}
2022-06-07 12:58:01,652 INFO Beginning training
Epoch 1/128
Traceback (most recent call last):
  File "/home/rap/bin/cosmoflow-benchmark/train.py", line 395, in <module>
    main()
  File "/home/rap/bin/cosmoflow-benchmark/train.py", line 367, in main
    model.fit(datasets['train_dataset'],
  File "/home/rap/.local/lib/python3.10/site-packages/keras/utils/traceback_utils.py", line 67, in error_handler
    raise e.with_traceback(filtered_tb) from None
  File "/home/rap/.local/lib/python3.10/site-packages/tensorflow/python/eager/execute.py", line 54, in quick_execute
    tensors = pywrap_tfe.TFE_Py_Execute(ctx._handle, device_name, op_name,
tensorflow.python.framework.errors_impl.DataLossError: Graph execution error:

Detected at node 'IteratorGetNext' defined at (most recent call last):
    File "/home/rap/bin/cosmoflow-benchmark/train.py", line 395, in <module>
      main()
    File "/home/rap/bin/cosmoflow-benchmark/train.py", line 367, in main
      model.fit(datasets['train_dataset'],
    File "/home/rap/.local/lib/python3.10/site-packages/keras/utils/traceback_utils.py", line 64, in error_handler
      return fn(*args, **kwargs)
    File "/home/rap/.local/lib/python3.10/site-packages/keras/engine/training.py", line 1409, in fit
      tmp_logs = self.train_function(iterator)
    File "/home/rap/.local/lib/python3.10/site-packages/keras/engine/training.py", line 1051, in train_function
      return step_function(self, iterator)
    File "/home/rap/.local/lib/python3.10/site-packages/keras/engine/training.py", line 1039, in step_function
      data = next(iterator)
Node: 'IteratorGetNext'
inflate() failed with error -3: incorrect header check
	 [[{{node IteratorGetNext}}]] [Op:__inference_train_function_1195]
@sparticlesteve
Copy link
Owner

Integrity of the downloaded data is probably a good first thing to check. Can you confirm how you downloaded the data? I can provide a checksum for the tarball, though given its small size you could even just feasibly try to download it again.

@sparticlesteve
Copy link
Owner

also, what's your software stack? That code is from September 2020 so there may be issues if you're trying to run with newer versions of TF, etc.

@RumitAP
Copy link
Author

RumitAP commented Jun 7, 2022

Integrity of the downloaded data is probably a good first thing to check. Can you confirm how you downloaded the data? I can provide a checksum for the tarball, though given its small size you could even just feasibly try to download it again.

I used wget on the portal.nersc.gov to retrieve the data. Should I be retrieving the data another way?

also, what's your software stack? That code is from September 2020 so there may be issues if you're trying to run with newer versions of TF, etc.

The tensorflow version I am using is 2.9.1. Is there a specifc version of tensorflow that you would recommend?

I am currently trying to get this to run on my local machine before I run multi-node or single-node on our HPC platform. It will run on the HPC platform on an Docker image (ubuntu base with the required package installs).

@RumitAP
Copy link
Author

RumitAP commented Jun 8, 2022

I also tried redownloading the data again but ran into the same issue.

Here is the image I am using on my local system:

FROM ubuntu:focal

RUN apt-get update -y

RUN apt-get -y install libmkl-avx2 python3-pip git wget

RUN mkdir -p /usr/local/src
WORKDIR /usr/local/src

RUN pip install tensorflow wandb

RUN git clone --recursive https://github.com/uber/horovod; cd horovod; python3 setup.py sdist

RUN pip install horovod/dist/horovod-0.24.3.tar.gz

RUN git clone https://github.com/mlperf/logging.git mlperf-logging

RUN pip install -e mlperf-logging

RUN  git clone https://github.com/sparticlesteve/cosmoflow-benchmark.git

RUN git clone https://github.com/azrael417/mlperf-deepcam.git

WORKDIR /usr/local/src/cosmoflow-benchmark

CMD wget https://portal.nersc.gov/project/dasrepo/cosmoflow-benchmark/cosmoUniverse_2019_05_4parE_tf_v2.tar; tar -xvf cosmoUniverse_2019_05_4parE_tf_v2.tar

@RumitAP
Copy link
Author

RumitAP commented Jun 13, 2022

Any thoughts? Still running into the same problem @sparticlesteve

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants