Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

java.lang.OutOfMemoryError: Java heap space after multiple getHTML calls #29

Open
alibozorgkhan opened this issue Dec 31, 2014 · 1 comment

Comments

@alibozorgkhan
Copy link

I need to extract article bodies from raw htmls. My code is as simple as:

for html in htmls:
    extractor = Extractor(extractor='ArticleExtractor', html=article)
    extractor.getHTML()

After calling a method of it, e.g. 10K times, I get java.lang.OutOfMemoryError error:

Traceback (most recent call last):
  File "test.py", line 228, in <module>
    extractor.getHTML()
  File "/Users/macuser/.virtualenvs/bro/lib/python2.7/site-packages/boilerpipe/extract/__init__.py", line 70, in getHTML
    return highlighter.process(self.source, self.data)
jpype._jexception.OutOfMemoryErrorPyRaisable: java.lang.OutOfMemoryError: Java heap space

I looked into the code and it looks like creating BoilerpipeSAXInput, HTMLHighlighter and other java instances causes this problem. Is there a way to fix this issue?

To reproduce this without 10K articles, simply reduce the heap size in boilerpipe.__init__:

MAX_JVM_HEAP_SIZE_MBYTES = 4

if jpype.isJVMStarted() != True:
    jars = []
    for top, dirs, files in os.walk(imp.find_module('boilerpipe')[1]+'/data'):
        for nm in files:
            jars.append(os.path.join(top, nm))

    jvm_args = [
        '-Xmx%dM' % MAX_JVM_HEAP_SIZE_MBYTES,
        "-Djava.class.path=%s" % os.pathsep.join(jars)
    ]
    jpype.startJVM(jpype.getDefaultJVMPath(), *jvm_args)
@joelthchao
Copy link

Same issue here, about 5k extractions from raw htmls

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants