Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Mango Build Error #447

Open
ssabnis opened this issue Oct 12, 2018 · 15 comments
Open

Mango Build Error #447

ssabnis opened this issue Oct 12, 2018 · 15 comments

Comments

@ssabnis
Copy link

ssabnis commented Oct 12, 2018

Hello,

I am new to genomics project.
I am running Mango and encountering build errors. Any help is greatly appreciated.

I have the following setup:

Package Versions:

- Python 2.7.5
- java version "1.8.0_171"
- Scala code runner version 2.11.12
- Hadoop 3.1.0
- Spark 2.3.1
- npm  3.10.10**

My .bashrc file entries

export JAVA_HOME=/usr
export SPARK_HOME=/opt/spark/spark-2.3.1-bin-hadoop2.7
export PATH=$PATH:$SPARK_HOME/bin
export PYTHONPATH=$SPARK_HOME/python:$SPARK_HOME/python/lib/py4j-0.10.7-src.zip:$PYTHONPATH

ASSEMBLY_DIR=/home/hadoop/mango/mango-assembly/target
ASSEMBLY_JAR="$(ls -1 "$ASSEMBLY_DIR" | grep "^mango-assembly[0-9A-Za-z\_\.-]*\.jar$" | grep -v javadoc | grep -v sources || true)"
export PYSPARK_SUBMIT_ARGS="--jars ${ASSEMBLY_DIR}/${ASSEMBLY_JAR} --driver-class-path ${ASSEMBLY_DIR}/${ASSEMBLY_JAR} pyspark-shell"

Command: mvn package -P python

BUILD ERROR

bdgenomics/mango/test/alignment_test.py::AlignmentTest::test_coverage_distribution FAILED
bdgenomics/mango/test/alignment_test.py::AlignmentTest::test_fragment_distribution FAILED
bdgenomics/mango/test/alignment_test.py::AlignmentTest::test_indel_distribution FAILED
bdgenomics/mango/test/alignment_test.py::AlignmentTest::test_indel_distribution_maximal_bin_size FAILED
bdgenomics/mango/test/alignment_test.py::AlignmentTest::test_indel_distribution_no_elements FAILED
bdgenomics/mango/test/alignment_test.py::AlignmentTest::test_mapq_distribution FAILED
bdgenomics/mango/test/alignment_test.py::AlignmentTest::test_visualize_alignments FAILED
bdgenomics/mango/test/coverage_test.py::CoverageTest::test_coverage_distribution FAILED
bdgenomics/mango/test/coverage_test.py::CoverageTest::test_example_coverage FAILED
bdgenomics/mango/test/distribution_test.py::DistributionTest::test_cumulative_count_distribution FAILED
bdgenomics/mango/test/distribution_test.py::DistributionTest::test_fail_on_invalid_sample FAILED
bdgenomics/mango/test/distribution_test.py::DistributionTest::test_normalized_count_distribution FAILED
bdgenomics/mango/test/distribution_test.py::DistributionTest::test_sampling FAILED
bdgenomics/mango/test/feature_test.py::FeatureTest::test_visualize_features FAILED
bdgenomics/mango/test/notebook_test.py::NotebookTest::test_alignment_example FAILED
bdgenomics/mango/test/notebook_test.py::NotebookTest::test_coverage_example FAILED
bdgenomics/mango/test/notebook_test.py::NotebookTest::test_example FAILED
bdgenomics/mango/test/variant_test.py::VariantTest::test_visualize_variants FAILED

=================================== FAILURES ===================================
___________________ AlignmentTest.test_coverage_distribution ___________________
bdgenomics/mango/test/__init__.py:65: in setUp
    self.ss = SparkSession.builder.master('local[4]').appName(class_name).getOrCreate()
/opt/spark/spark-2.3.1-bin-hadoop2.7/python/pyspark/sql/session.py:173: in getOrCreate
    sc = SparkContext.getOrCreate(sparkConf)
/opt/spark/spark-2.3.1-bin-hadoop2.7/python/pyspark/context.py:343: in getOrCreate
    SparkContext(conf=conf or SparkConf())
/opt/spark/spark-2.3.1-bin-hadoop2.7/python/pyspark/context.py:115: in __init__
    SparkContext._ensure_initialized(self, gateway=gateway, conf=conf)
/opt/spark/spark-2.3.1-bin-hadoop2.7/python/pyspark/context.py:292: in _ensure_initialized
    SparkContext._gateway = gateway or launch_gateway(conf)
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

conf = <pyspark.conf.SparkConf object at 0x7f8fcdb7fd10>

    def launch_gateway(conf=None):
        """
        launch jvm gateway
        :param conf: spark configuration passed to spark-submit
        :return:
        """
        if "PYSPARK_GATEWAY_PORT" in os.environ:
            gateway_port = int(os.environ["PYSPARK_GATEWAY_PORT"])
            gateway_secret = os.environ["PYSPARK_GATEWAY_SECRET"]
        else:
            SPARK_HOME = _find_spark_home()
            # Launch the Py4j gateway using Spark's run command so that we pick up the
            # proper classpath and settings from spark-env.sh
            on_windows = platform.system() == "Windows"
            script = "./bin/spark-submit.cmd" if on_windows else "./bin/spark-submit"
            command = [os.path.join(SPARK_HOME, script)]
            if conf:
                for k, v in conf.getAll():
                    command += ['--conf', '%s=%s' % (k, v)]
            submit_args = os.environ.get("PYSPARK_SUBMIT_ARGS", "pyspark-shell")
            if os.environ.get("SPARK_TESTING"):
                submit_args = ' '.join([
                    "--conf spark.ui.enabled=false",
                    submit_args
                ])
            command = command + shlex.split(submit_args)

            # Create a temporary directory where the gateway server should write the connection
            # information.
            conn_info_dir = tempfile.mkdtemp()
            try:
                fd, conn_info_file = tempfile.mkstemp(dir=conn_info_dir)
                os.close(fd)
                os.unlink(conn_info_file)

                env = dict(os.environ)
                env["_PYSPARK_DRIVER_CONN_INFO_PATH"] = conn_info_file

                # Launch the Java gateway.
                # We open a pipe to stdin so that the Java gateway can die when the pipe is broken
                if not on_windows:
                    # Don't send ctrl-c / SIGINT to the Java gateway:
                    def preexec_func():
                        signal.signal(signal.SIGINT, signal.SIG_IGN)
                    proc = Popen(command, stdin=PIPE, preexec_fn=preexec_func, env=env)
                else:
                    # preexec_fn not supported on Windows
                    proc = Popen(command, stdin=PIPE, env=env)

                # Wait for the file to appear, or for the process to exit, whichever happens first.
                while not proc.poll() and not os.path.isfile(conn_info_file):
                    time.sleep(0.1)

                if not os.path.isfile(conn_info_file):
                   raise Exception("Java gateway process exited before sending its port number")
                  Exception: Java gateway process exited before sending its port number

/opt/spark/spark-2.3.1-bin-hadoop2.7/python/pyspark/java_gateway.py:93: Exception
----------------------------- Captured stderr call -----------------------------
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/home/hadoop/mango/mango-assembly/target/mango-assembly-0.0.2-SNAPSHOT.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/opt/spark/spark-2.3.1-bin-hadoop2.7/jars/slf4j-log4j12-1.7.16.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
2018-10-12 11:20:03 ERROR SparkUncaughtExceptionHandler:91 - Uncaught exception in thread Thread[main,5,main]
java.util.NoSuchElementException: key not found: _PYSPARK_DRIVER_CALLBACK_HOST
        at scala.collection.MapLike$class.default(MapLike.scala:228)
        at scala.collection.AbstractMap.default(Map.scala:59)
        at scala.collection.MapLike$class.apply(MapLike.scala:141)
        at scala.collection.AbstractMap.apply(Map.scala:59)
        at org.apache.spark.api.python.PythonGatewayServer$$anonfun$main$1.apply$mcV$sp(PythonGatewayServer.scala:50)
        at org.apache.spark.util.Utils$.tryOrExit(Utils.scala:1262)
        at org.apache.spark.api.python.PythonGatewayServer$.main(PythonGatewayServer.scala:37)
        at org.apache.spark.api.python.PythonGatewayServer.main(PythonGatewayServer.scala)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.lang.reflect.Method.invoke(Method.java:498)
        at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:775)
        at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:180)
        at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:205)
        at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:119)
        at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
@akmorrow13
Copy link
Contributor

Hi @ssabnis ! Have your run make prepare from the mango-python directory? Also, are you running in a virtual environment?

@ssabnis
Copy link
Author

ssabnis commented Oct 12, 2018

@akmorrow13 thanks for quick reply, I did build the make prepare, I am not running in virtual environment.

@akmorrow13
Copy link
Contributor

I think this is a Spark versioning issue. You are using Spark 2.3.1, but Mango is pre-built for Spark 2.2.1. More specifically, Spark 2.3.1 uses a new version of py4j (0.10.7) that removed _PYSPARK_DRIVER_CALLBACK_HOST. However, Spark 2.2.1 uses py4j version 0.10.4. To fix this issue, try updating the Mango pom to run on your installed hadoop and spark versions and recompile.

@ssabnis
Copy link
Author

ssabnis commented Oct 12, 2018

@akmorrow13 thank you, I also came across the following. ./scripts/move_to_spark2.sh is it required to run in order to use the spark 2.3.1 ? I will update the pom and test it again.

@ssabnis
Copy link
Author

ssabnis commented Oct 12, 2018

@akmorrow13 looks like hadoop 3.1.0 , spark 2.3.1 and Parquet 1.8.2 have an issue, I get a different error now.

java.lang.NoSuchMethodError: org.apache.parquet.column.statistics.Statistics.getBuilderForReading(Lorg/apache/parquet/schema/PrimitiveType$PrimitiveTypeName;)Lorg/apache/parquet/column/statistics/Statistics$Builder;
        at org.apache.parquet.format.converter.ParquetMetadataConverter.fromParquetStatisticsInternal(ParquetMetadataConverter.java:340)
        at org.apache.parquet.format.converter.ParquetMetadataConverter.fromParquetStatistics(ParquetMetadataConverter.java:365)
        at org.apache.parquet.format.converter.ParquetMetadataConverter.fromParquetMetadata(ParquetMetadataConverter.java:821)
        at org.apache.parquet.format.converter.ParquetMetadataConverter.readParquetMetadata(ParquetMetadataConverter.java:798)
        at org.apache.parquet.hadoop.ParquetFileReader.readFooter(ParquetFileReader.java:484)
        at org.apache.parquet.hadoop.ParquetFileReader.<init>(ParquetFileReader.java:568)
        at org.apache.parquet.hadoop.ParquetFileReader.open(ParquetFileReader.java:492)
        at org.apache.parquet.hadoop.ParquetRecordReader.initializeInternalReader(ParquetRecordReader.java:166)
        at org.apache.parquet.hadoop.ParquetRecordReader.initialize(ParquetRecordReader.java:147)
        at org.apache.spark.rdd.NewHadoopRDD$$anon$1.liftedTree1$1(NewHadoopRDD.scala:189)
        at org.apache.spark.rdd.NewHadoopRDD$$anon$1.<init>(NewHadoopRDD.scala:186)
        at org.apache.spark.rdd.NewHadoopRDD.compute(NewHadoopRDD.scala:141)
        at org.apache.spark.rdd.NewHadoopRDD.compute(NewHadoopRDD.scala:70)

@akmorrow13
Copy link
Contributor

Spark 2.3.1 uses Parquet 1.10.0 https://github.com/apache/spark/blob/master/pom.xml#L132, so you would have to change to this in the Mango pom as well.

Just a warning, Mango has not been tested yet with these newer versions.

@ssabnis
Copy link
Author

ssabnis commented Oct 12, 2018

@akmorrow13 you are right, now I get BROTLI codec error

java.lang.NoSuchFieldError: BROTLI
        at org.apache.parquet.hadoop.metadata.CompressionCodecName.<clinit>(CompressionCodecName.java:31)
        at org.apache.parquet.format.converter.ParquetMetadataConverter.fromParquetMetadata(ParquetMetadataConverter.java:821)
        at org.apache.parquet.format.converter.ParquetMetadataConverter.readParquetMetadata(ParquetMetadataConverter.java:798)
        at org.apache.parquet.hadoop.ParquetFileReader.readFooter(ParquetFileReader.java:484)
        at org.apache.parquet.hadoop.ParquetFileReader.<init>(ParquetFileReader.java:568)
        at org.apache.parquet.hadoop.ParquetFileReader.open(ParquetFileReader.java:492)
        at org.apache.parquet.hadoop.ParquetRecordReader.initializeInternalReader(ParquetRecordReader.java:166)
        at org.apache.parquet.hadoop.ParquetRecordReader.initialize(ParquetRecordReader.java:147)
        at org.apache.spark.rdd.NewHadoopRDD$$anon$1.liftedTree1$1(NewHadoopRDD.scala:189)
        at org.apache.spark.rdd.NewHadoopRDD$$anon$1.<init>(NewHadoopRDD.scala:186)
        at org.apache.spark.rdd.NewHadoopRDD.compute(NewHadoopRDD.scala:141)
        at org.apache.spark.rdd.NewHadoopRDD.compute(NewHadoopRDD.scala:70)

pom.xml entries

  <properties>
    <adam.version>0.24.0</adam.version>
    <avro.version>1.8.1</avro.version>
    <bdg-formats.version>0.11.3</bdg-formats.version>
    <bdg-utils.version>0.2.13</bdg-utils.version>
    <convert.version>0.3.0</convert.version>
    <java.version>1.8</java.version>
    <jetty.version>9.2.17.v20160517</jetty.version>
    <ga4gh.version>0.6.0a10</ga4gh.version>
    <hadoop.version>3.1.0</hadoop.version>
    <hadoop-bam.version>7.9.2</hadoop-bam.version>
    <htsjdk.version>2.9.1</htsjdk.version>
    <parquet.version>1.10.0</parquet.version>
    <scala.version>2.11.12</scala.version>
    <scala.version.prefix>2.11</scala.version.prefix>
    <scalatra.version>2.4.1</scalatra.version>
    <spark.version>2.3.1</spark.version>
    <spark.version.prefix>-spark2_</spark.version.prefix>
    <snappy.version>1.0.5</snappy.version>
    <scoverage.plugin.version>1.1.1</scoverage.plugin.version>
    <protobuf.version>3.0.0-beta-3</protobuf.version>
  </properties>

@ssabnis
Copy link
Author

ssabnis commented Oct 12, 2018

@akmorrow13 I have BAM file in HDFS, I need to visualize it using Mango, any suggestions to get over the issue and make UI work using mango?

@ssabnis
Copy link
Author

ssabnis commented Oct 12, 2018

@akmorrow13 thanks for the help. I changed the spark version to 2.2.1 and reconfigured. I get the FAILED tests. Any clue. I am attaching the build output file. Thanks
output.err2.zip

@akmorrow13
Copy link
Contributor

@ssabnis can you please post the errors in github? It is easiest for debugging and issue documentation.

self._jvm.org.bdgenomics.adam.rdd.ADAMContext.ADAMContextFromSession(ss._jsparkSession)
E       TypeError: 'JavaPackage' object is not callable

generally means that python cannot find the jar file.

Make sure you have correctly set:

ASSEMBLY_DIR=/home/hadoop/mango/mango-assembly/target
ASSEMBLY_JAR="$(ls -1 "$ASSEMBLY_DIR" | grep "^mango-assembly[0-9A-Za-z\_\.-]*\.jar$" | grep -v javadoc | grep -v sources || true)"
export PYSPARK_SUBMIT_ARGS="--jars ${ASSEMBLY_DIR}/${ASSEMBLY_JAR} --driver-class-path ${ASSEMBLY_DIR}/${ASSEMBLY_JAR} pyspark-shell"

And that echo $ASSEMBLY_DIR/$ASSEMBLY_JAR correctly points to the compiled jar.

@ssabnis
Copy link
Author

ssabnis commented Oct 14, 2018

@akmorrow13 I am able to compile, I forgot to do the mvn clean package before the creating Python compile. All good now. Thanks.

is there Mango submit command to use a BAM on the local HDFS/Spark that I have. Any reference will help.

@ssabnis
Copy link
Author

ssabnis commented Oct 16, 2018

What are steps and tools to visualize the Genome BAM files.

@akmorrow13
Copy link
Contributor

Please take a look at our readthedocs . Under usage and examples, there is both a python and browser based tool that allow visualization of bam files.

@ssabnis
Copy link
Author

ssabnis commented Oct 16, 2018

@akmorrow13 one last question you may help. is there a large genome data set that I can use with mango to visualize? Any references will help.

@akmorrow13
Copy link
Contributor

@ssabnis one free dataset that you can access is the 1000 genomes dataset. If you are running on AWS, it is hosted there. You can see Mango's aws notebook tutorial which accesses these files. Instructions for running on AWS can be found on readthedocs .

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants