You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I need to extract article bodies from raw htmls. My code is as simple as:
for html in htmls:
extractor = Extractor(extractor='ArticleExtractor', html=article)
extractor.getHTML()
After calling a method of it, e.g. 10K times, I get java.lang.OutOfMemoryError error:
Traceback (most recent call last):
File "test.py", line 228, in <module>
extractor.getHTML()
File "/Users/macuser/.virtualenvs/bro/lib/python2.7/site-packages/boilerpipe/extract/__init__.py", line 70, in getHTML
return highlighter.process(self.source, self.data)
jpype._jexception.OutOfMemoryErrorPyRaisable: java.lang.OutOfMemoryError: Java heap space
I looked into the code and it looks like creating BoilerpipeSAXInput, HTMLHighlighter and other java instances causes this problem. Is there a way to fix this issue?
To reproduce this without 10K articles, simply reduce the heap size in boilerpipe.__init__:
MAX_JVM_HEAP_SIZE_MBYTES = 4
if jpype.isJVMStarted() != True:
jars = []
for top, dirs, files in os.walk(imp.find_module('boilerpipe')[1]+'/data'):
for nm in files:
jars.append(os.path.join(top, nm))
jvm_args = [
'-Xmx%dM' % MAX_JVM_HEAP_SIZE_MBYTES,
"-Djava.class.path=%s" % os.pathsep.join(jars)
]
jpype.startJVM(jpype.getDefaultJVMPath(), *jvm_args)
The text was updated successfully, but these errors were encountered:
I need to extract article bodies from raw htmls. My code is as simple as:
After calling a method of it, e.g. 10K times, I get
java.lang.OutOfMemoryError
error:I looked into the code and it looks like creating
BoilerpipeSAXInput
,HTMLHighlighter
and other java instances causes this problem. Is there a way to fix this issue?To reproduce this without 10K articles, simply reduce the heap size in
boilerpipe.__init__
:The text was updated successfully, but these errors were encountered: