Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Py4JJavaError: An error occurred while calling lemmatizer = LemmatizerModel.pretrained() #5774

Closed
dkaenzig opened this issue Jul 5, 2021 · 4 comments
Assignees
Labels

Comments

@dkaenzig
Copy link

dkaenzig commented Jul 5, 2021

Hi, I am new to spark-nlp. As my first project, I tried to replicate the analysis here: https://towardsdatascience.com/natural-language-processing-with-pyspark-and-spark-nlp-b5b29f8faba. I was able to set up Spark, following the instructions here:
https://phoenixnap.com/kb/install-spark-on-windows-10, and am able to run Spark in a Python Jupyter notebook. However, when I try to load a pretrained model, e.g. lemmatizer = LemmatizerModel.pretrained(), I run into errors. Other tasks, e.g. loading a .parquet file work well.

Description

My code is the following:

import sparknlp
sparknlp.start() 

from pyspark.sql import SparkSession

spark = SparkSession.builder \
    .master('local[*]') \
    .appName('Spark NLP') \
    .config("spark.jars.packages", "com.johnsnowlabs.nlp:spark-nlp_2.11:2.3.5") \
    .getOrCreate()

spark.conf.set("spark.sql.repl.eagerEval.enabled", True)

from sparknlp.base import Finisher, DocumentAssembler
from sparknlp.annotator import (Tokenizer, Normalizer, 
                                LemmatizerModel, StopWordsCleaner)
from pyspark.ml import Pipeline
from sparknlp.pretrained import PretrainedPipeline 

lemmatizer = LemmatizerModel.pretrained() \
    .setInputCols(['normalized']) \
    .setOutputCol('lemma') 

Unfortunately, this does not work as expected. I run into the following error:

lemma_antbnc download started this may take some time.
Approximate size to download 907.6 KB
[OK!]

---------------------------------------------------------------------------
IllegalArgumentException                  Traceback (most recent call last)
<ipython-input-16-5f238762f2fd> in <module>
     18 #lemmatizer = LemmatizerModel.load('lemma_nl_2.5.0_2.4_1588532720582')\
     19 #lemmatizer = LemmatizerModel().load("lemma_nl_2.5.0_2.4_1588532720582")\
---> 20 lemmatizer = LemmatizerModel.pretrained() \
     21     .setInputCols(['normalized']) \
     22     .setOutputCol('lemma') \

~\miniconda3\envs\nlpspark\lib\site-packages\sparknlp\annotator.py in pretrained(name, lang, remote_loc)
    686     def pretrained(name="lemma_antbnc", lang="en", remote_loc=None):
    687         from sparknlp.pretrained import ResourceDownloader
--> 688         return ResourceDownloader.downloadModel(LemmatizerModel, name, lang, remote_loc)
    689 
    690 

~\miniconda3\envs\nlpspark\lib\site-packages\sparknlp\pretrained.py in downloadModel(reader, name, language, remote_loc, j_dwn)
     55             t1.start()
     56             try:
---> 57                 j_obj = _internal._DownloadModel(reader.name, name, language, remote_loc, j_dwn).apply()
     58             except Py4JJavaError as e:
     59                 sys.stdout.write("\n" + str(e))

~\miniconda3\envs\nlpspark\lib\site-packages\sparknlp\internal.py in __init__(self, reader, name, language, remote_loc, validator)
    189 class _DownloadModel(ExtendedJavaWrapper):
    190     def __init__(self, reader, name, language, remote_loc, validator):
--> 191         super(_DownloadModel, self).__init__("com.johnsnowlabs.nlp.pretrained."+validator+".downloadModel", reader, name, language, remote_loc)
    192 
    193 

~\miniconda3\envs\nlpspark\lib\site-packages\sparknlp\internal.py in __init__(self, java_obj, *args)
    142         super(ExtendedJavaWrapper, self).__init__(java_obj)
    143         self.sc = SparkContext._active_spark_context
--> 144         self._java_obj = self.new_java_obj(java_obj, *args)
    145         self.java_obj = self._java_obj
    146 

~\miniconda3\envs\nlpspark\lib\site-packages\sparknlp\internal.py in new_java_obj(self, java_class, *args)
    152 
    153     def new_java_obj(self, java_class, *args):
--> 154         return self._new_java_obj(java_class, *args)
    155 
    156     def new_java_array(self, pylist, java_class):

~\miniconda3\envs\nlpspark\lib\site-packages\pyspark\ml\wrapper.py in _new_java_obj(java_class, *args)
     64             java_obj = getattr(java_obj, name)
     65         java_args = [_py2java(sc, arg) for arg in args]
---> 66         return java_obj(*java_args)
     67 
     68     @staticmethod

~\miniconda3\envs\nlpspark\lib\site-packages\py4j\java_gateway.py in __call__(self, *args)
   1302 
   1303         answer = self.gateway_client.send_command(command)
-> 1304         return_value = get_return_value(
   1305             answer, self.gateway_client, self.target_id, self.name)
   1306 

~\miniconda3\envs\nlpspark\lib\site-packages\pyspark\sql\utils.py in deco(*a, **kw)
    115                 # Hide where the exception came from that shows a non-Pythonic
    116                 # JVM exception message.
--> 117                 raise converted from None
    118             else:
    119                 raise

IllegalArgumentException: requirement failed: Was not found appropriate resource to download for request: ResourceRequest(lemma_antbnc,Some(en),public/models,3.1.1,3.1.2) with downloader: com.johnsnowlabs.nlp.pretrained.S3ResourceDownloader@1c77ab69

Interestingly, when I run it one more time, I still get an error but the error changes:

---------------------------------------------------------------------------
Py4JJavaError                             Traceback (most recent call last)
<ipython-input-17-5f238762f2fd> in <module>
     18 #lemmatizer = LemmatizerModel.load('lemma_nl_2.5.0_2.4_1588532720582')\
     19 #lemmatizer = LemmatizerModel().load("lemma_nl_2.5.0_2.4_1588532720582")\
---> 20 lemmatizer = LemmatizerModel.pretrained() \
     21     .setInputCols(['normalized']) \
     22     .setOutputCol('lemma') \

~\miniconda3\envs\nlpspark\lib\site-packages\sparknlp\annotator.py in pretrained(name, lang, remote_loc)
    686     def pretrained(name="lemma_antbnc", lang="en", remote_loc=None):
    687         from sparknlp.pretrained import ResourceDownloader
--> 688         return ResourceDownloader.downloadModel(LemmatizerModel, name, lang, remote_loc)
    689 
    690 

~\miniconda3\envs\nlpspark\lib\site-packages\sparknlp\pretrained.py in downloadModel(reader, name, language, remote_loc, j_dwn)
     58             except Py4JJavaError as e:
     59                 sys.stdout.write("\n" + str(e))
---> 60                 raise e
     61             finally:
     62                 stop_threads = True

~\miniconda3\envs\nlpspark\lib\site-packages\sparknlp\pretrained.py in downloadModel(reader, name, language, remote_loc, j_dwn)
     55             t1.start()
     56             try:
---> 57                 j_obj = _internal._DownloadModel(reader.name, name, language, remote_loc, j_dwn).apply()
     58             except Py4JJavaError as e:
     59                 sys.stdout.write("\n" + str(e))

~\miniconda3\envs\nlpspark\lib\site-packages\sparknlp\internal.py in __init__(self, reader, name, language, remote_loc, validator)
    189 class _DownloadModel(ExtendedJavaWrapper):
    190     def __init__(self, reader, name, language, remote_loc, validator):
--> 191         super(_DownloadModel, self).__init__("com.johnsnowlabs.nlp.pretrained."+validator+".downloadModel", reader, name, language, remote_loc)
    192 
    193 

~\miniconda3\envs\nlpspark\lib\site-packages\sparknlp\internal.py in __init__(self, java_obj, *args)
    142         super(ExtendedJavaWrapper, self).__init__(java_obj)
    143         self.sc = SparkContext._active_spark_context
--> 144         self._java_obj = self.new_java_obj(java_obj, *args)
    145         self.java_obj = self._java_obj
    146 

~\miniconda3\envs\nlpspark\lib\site-packages\sparknlp\internal.py in new_java_obj(self, java_class, *args)
    152 
    153     def new_java_obj(self, java_class, *args):
--> 154         return self._new_java_obj(java_class, *args)
    155 
    156     def new_java_array(self, pylist, java_class):

~\miniconda3\envs\nlpspark\lib\site-packages\pyspark\ml\wrapper.py in _new_java_obj(java_class, *args)
     64             java_obj = getattr(java_obj, name)
     65         java_args = [_py2java(sc, arg) for arg in args]
---> 66         return java_obj(*java_args)
     67 
     68     @staticmethod

~\miniconda3\envs\nlpspark\lib\site-packages\py4j\java_gateway.py in __call__(self, *args)
   1302 
   1303         answer = self.gateway_client.send_command(command)
-> 1304         return_value = get_return_value(
   1305             answer, self.gateway_client, self.target_id, self.name)
   1306 

~\miniconda3\envs\nlpspark\lib\site-packages\pyspark\sql\utils.py in deco(*a, **kw)
    109     def deco(*a, **kw):
    110         try:
--> 111             return f(*a, **kw)
    112         except py4j.protocol.Py4JJavaError as e:
    113             converted = convert_exception(e.java_exception)

~\miniconda3\envs\nlpspark\lib\site-packages\py4j\protocol.py in get_return_value(answer, gateway_client, target_id, name)
    324             value = OUTPUT_CONVERTER[type](answer[2:], gateway_client)
    325             if answer[1] == REFERENCE_TYPE:
--> 326                 raise Py4JJavaError(
    327                     "An error occurred while calling {0}{1}{2}.\n".
    328                     format(target_id, ".", name), value)

Py4JJavaError: An error occurred while calling z:com.johnsnowlabs.nlp.pretrained.PythonResourceDownloader.downloadModel.
: org.apache.hadoop.mapred.InvalidInputException: Input path does not exist: file:/C:/Users/dkaenzig/cache_pretrained/lemma_antbnc_en_2.0.2_2.4_1556480454569/metadata
	at org.apache.hadoop.mapred.FileInputFormat.singleThreadedListStatus(FileInputFormat.java:287)
	at org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:229)
	at org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:315)
	at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:205)
	at org.apache.spark.rdd.RDD.$anonfun$partitions$2(RDD.scala:300)
	at scala.Option.getOrElse(Option.scala:189)
	at org.apache.spark.rdd.RDD.partitions(RDD.scala:296)
	at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:49)
	at org.apache.spark.rdd.RDD.$anonfun$partitions$2(RDD.scala:300)
	at scala.Option.getOrElse(Option.scala:189)
	at org.apache.spark.rdd.RDD.partitions(RDD.scala:296)
	at org.apache.spark.rdd.RDD.$anonfun$take$1(RDD.scala:1428)
	at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
	at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
	at org.apache.spark.rdd.RDD.withScope(RDD.scala:414)
	at org.apache.spark.rdd.RDD.take(RDD.scala:1422)
	at org.apache.spark.rdd.RDD.$anonfun$first$1(RDD.scala:1463)
	at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
	at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
	at org.apache.spark.rdd.RDD.withScope(RDD.scala:414)
	at org.apache.spark.rdd.RDD.first(RDD.scala:1463)
	at org.apache.spark.ml.util.DefaultParamsReader$.loadMetadata(ReadWrite.scala:587)
	at org.apache.spark.ml.util.DefaultParamsReader.load(ReadWrite.scala:465)
	at com.johnsnowlabs.nlp.FeaturesReader.load(ParamsAndFeaturesReadable.scala:12)
	at com.johnsnowlabs.nlp.FeaturesReader.load(ParamsAndFeaturesReadable.scala:8)
	at com.johnsnowlabs.nlp.pretrained.ResourceDownloader$.downloadModel(ResourceDownloader.scala:381)
	at com.johnsnowlabs.nlp.pretrained.ResourceDownloader$.downloadModel(ResourceDownloader.scala:375)
	at com.johnsnowlabs.nlp.pretrained.PythonResourceDownloader$.downloadModel(ResourceDownloader.scala:499)
	at com.johnsnowlabs.nlp.pretrained.PythonResourceDownloader.downloadModel(ResourceDownloader.scala)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
	at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
	at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
	at py4j.Gateway.invoke(Gateway.java:282)
	at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
	at py4j.commands.CallCommand.execute(CallCommand.java:79)
	at py4j.GatewayConnection.run(GatewayConnection.java:238)
	at java.lang.Thread.run(Thread.java:748)

I found a similar thread here: #846 but none of the fixes there worked for me. In particular, I have verified that I have Java 8, that the environment variables are (to the best of my knowledge) set correctly, and I updated sparknlp to the newest version.

I also followed the advice there and downloaded the model and tried to load it manually but with no avail. This leads to another error:

---------------------------------------------------------------------------
Py4JJavaError                             Traceback (most recent call last)
<ipython-input-21-60e5ca4666d8> in <module>
----> 1 lemmatizer = LemmatizerModel().load("lemma_nl_2.5.0_2.4_1588532720582")\
      2     .setInputCols(['normalized']) \
      3     .setOutputCol('lemma')

~\miniconda3\envs\nlpspark\lib\site-packages\pyspark\ml\util.py in load(cls, path)
    330     def load(cls, path):
    331         """Reads an ML instance from the input path, a shortcut of `read().load(path)`."""
--> 332         return cls.read().load(path)
    333 
    334 

~\miniconda3\envs\nlpspark\lib\site-packages\pyspark\ml\util.py in load(self, path)
    280         if not isinstance(path, str):
    281             raise TypeError("path should be a string, got type %s" % type(path))
--> 282         java_obj = self._jread.load(path)
    283         if not hasattr(self._clazz, "_from_java"):
    284             raise NotImplementedError("This Java ML type cannot be loaded into Python currently: %r"

~\miniconda3\envs\nlpspark\lib\site-packages\py4j\java_gateway.py in __call__(self, *args)
   1302 
   1303         answer = self.gateway_client.send_command(command)
-> 1304         return_value = get_return_value(
   1305             answer, self.gateway_client, self.target_id, self.name)
   1306 

~\miniconda3\envs\nlpspark\lib\site-packages\pyspark\sql\utils.py in deco(*a, **kw)
    109     def deco(*a, **kw):
    110         try:
--> 111             return f(*a, **kw)
    112         except py4j.protocol.Py4JJavaError as e:
    113             converted = convert_exception(e.java_exception)

~\miniconda3\envs\nlpspark\lib\site-packages\py4j\protocol.py in get_return_value(answer, gateway_client, target_id, name)
    324             value = OUTPUT_CONVERTER[type](answer[2:], gateway_client)
    325             if answer[1] == REFERENCE_TYPE:
--> 326                 raise Py4JJavaError(
    327                     "An error occurred while calling {0}{1}{2}.\n".
    328                     format(target_id, ".", name), value)

Py4JJavaError: An error occurred while calling o149.load.
: java.lang.RuntimeException: Error while running command to get file permissions : ExitCodeException exitCode=-1073741701: 
	at org.apache.hadoop.util.Shell.runCommand(Shell.java:585)
	at org.apache.hadoop.util.Shell.run(Shell.java:482)
	at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:776)
	at org.apache.hadoop.util.Shell.execCommand(Shell.java:869)
	at org.apache.hadoop.util.Shell.execCommand(Shell.java:852)
	at org.apache.hadoop.fs.FileUtil.execCommand(FileUtil.java:1097)
	at org.apache.hadoop.fs.RawLocalFileSystem$DeprecatedRawLocalFileStatus.loadPermissionInfo(RawLocalFileSystem.java:659)
	at org.apache.hadoop.fs.RawLocalFileSystem$DeprecatedRawLocalFileStatus.getPermission(RawLocalFileSystem.java:634)
	at org.apache.hadoop.fs.LocatedFileStatus.<init>(LocatedFileStatus.java:49)
	at org.apache.hadoop.fs.FileSystem$4.next(FileSystem.java:1733)
	at org.apache.hadoop.fs.FileSystem$4.next(FileSystem.java:1713)
	at org.apache.hadoop.mapred.FileInputFormat.singleThreadedListStatus(FileInputFormat.java:270)
	at org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:229)
	at org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:315)
	at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:205)
	at org.apache.spark.rdd.RDD.$anonfun$partitions$2(RDD.scala:300)
	at scala.Option.getOrElse(Option.scala:189)
	at org.apache.spark.rdd.RDD.partitions(RDD.scala:296)
	at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:49)
	at org.apache.spark.rdd.RDD.$anonfun$partitions$2(RDD.scala:300)
	at scala.Option.getOrElse(Option.scala:189)
	at org.apache.spark.rdd.RDD.partitions(RDD.scala:296)
	at org.apache.spark.rdd.RDD.$anonfun$take$1(RDD.scala:1428)
	at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
	at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
	at org.apache.spark.rdd.RDD.withScope(RDD.scala:414)
	at org.apache.spark.rdd.RDD.take(RDD.scala:1422)
	at org.apache.spark.rdd.RDD.$anonfun$first$1(RDD.scala:1463)
	at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
	at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
	at org.apache.spark.rdd.RDD.withScope(RDD.scala:414)
	at org.apache.spark.rdd.RDD.first(RDD.scala:1463)
	at org.apache.spark.ml.util.DefaultParamsReader$.loadMetadata(ReadWrite.scala:587)
	at org.apache.spark.ml.util.DefaultParamsReader.load(ReadWrite.scala:465)
	at com.johnsnowlabs.nlp.FeaturesReader.load(ParamsAndFeaturesReadable.scala:12)
	at com.johnsnowlabs.nlp.FeaturesReader.load(ParamsAndFeaturesReadable.scala:8)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
	at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
	at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
	at py4j.Gateway.invoke(Gateway.java:282)
	at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
	at py4j.commands.CallCommand.execute(CallCommand.java:79)
	at py4j.GatewayConnection.run(GatewayConnection.java:238)
	at java.lang.Thread.run(Thread.java:748)

	at org.apache.hadoop.fs.RawLocalFileSystem$DeprecatedRawLocalFileStatus.loadPermissionInfo(RawLocalFileSystem.java:699)
	at org.apache.hadoop.fs.RawLocalFileSystem$DeprecatedRawLocalFileStatus.getPermission(RawLocalFileSystem.java:634)
	at org.apache.hadoop.fs.LocatedFileStatus.<init>(LocatedFileStatus.java:49)
	at org.apache.hadoop.fs.FileSystem$4.next(FileSystem.java:1733)
	at org.apache.hadoop.fs.FileSystem$4.next(FileSystem.java:1713)
	at org.apache.hadoop.mapred.FileInputFormat.singleThreadedListStatus(FileInputFormat.java:270)
	at org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:229)
	at org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:315)
	at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:205)
	at org.apache.spark.rdd.RDD.$anonfun$partitions$2(RDD.scala:300)
	at scala.Option.getOrElse(Option.scala:189)
	at org.apache.spark.rdd.RDD.partitions(RDD.scala:296)
	at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:49)
	at org.apache.spark.rdd.RDD.$anonfun$partitions$2(RDD.scala:300)
	at scala.Option.getOrElse(Option.scala:189)
	at org.apache.spark.rdd.RDD.partitions(RDD.scala:296)
	at org.apache.spark.rdd.RDD.$anonfun$take$1(RDD.scala:1428)
	at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
	at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
	at org.apache.spark.rdd.RDD.withScope(RDD.scala:414)
	at org.apache.spark.rdd.RDD.take(RDD.scala:1422)
	at org.apache.spark.rdd.RDD.$anonfun$first$1(RDD.scala:1463)
	at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
	at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
	at org.apache.spark.rdd.RDD.withScope(RDD.scala:414)
	at org.apache.spark.rdd.RDD.first(RDD.scala:1463)
	at org.apache.spark.ml.util.DefaultParamsReader$.loadMetadata(ReadWrite.scala:587)
	at org.apache.spark.ml.util.DefaultParamsReader.load(ReadWrite.scala:465)
	at com.johnsnowlabs.nlp.FeaturesReader.load(ParamsAndFeaturesReadable.scala:12)
	at com.johnsnowlabs.nlp.FeaturesReader.load(ParamsAndFeaturesReadable.scala:8)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
	at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
	at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
	at py4j.Gateway.invoke(Gateway.java:282)
	at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
	at py4j.commands.CallCommand.execute(CallCommand.java:79)
	at py4j.GatewayConnection.run(GatewayConnection.java:238)
	at java.lang.Thread.run(Thread.java:748)

This error makes me think that the issue may be related to hadoop and file permissions. Do you have any idea what the problem could be and how to fix it?

My Environment

  • Spark NLP version sparknlp.version(): '3.1.1'
  • Apache NLP version spark.version: '3.1.2'
  • Java version java -version: openjdk version "1.8.0_292"
    OpenJDK Runtime Environment (AdoptOpenJDK)(build 1.8.0_292-b10)
    OpenJDK 64-Bit Server VM (AdoptOpenJDK)(build 25.292-b10, mixed mode)
  • Setup and installation (Pypi, Conda, Maven, etc.): miniconda, Python 3.7.9, spark-nlp 3.1.1, pyspark 3.1.1
  • Operating System and version: Windows 10, 20H2

Environment variables:
image

image
image

PS. I tried with Spark build for Hadoop 3.2 but the problem was the same.

Thanks so much for your help in advance! Best wishes,

Diego

@maziyarpanahi
Copy link
Member

Hi,

First of all, thank you for the detailed report, this really helps to narrow down the issue.

  • For the .pretrained() issue this is the actual error:
: org.apache.hadoop.mapred.InvalidInputException: Input path does not exist: file:/C:/Users/dkaenzig/cache_pretrained/lemma_antbnc_en_2.0.2_2.4_1556480454569/metadata

By default, Spark NLP will download the models/pipelines in the home directory of the user. Here is C:/Users/dkaenzig/cache_pretrained and for some reason, it either doesn't exist or doesn't have the permissions to read/write/execute. (or there is something missing in Hadoop/Spark config on Windows which I point you to at the end, but you can check if the path exists and permissions are OK in this step)

  • For loading the models offline, this is the error:
: java.lang.RuntimeException: Error while running command to get file permissions : ExitCodeException exitCode=-1073741701: 

It seems you put that model right in the root and it doesn't have enough permissions to read and execute it. This also can be related to the configurations on Windows but it would be great to have the directory somewhere that you have enough permissions

In the end, it is really good that you have PySpark working correctly, but Spark NLP needs a couple of more stuff:

  • First, please use the latest release. What you are using in your code is using a very old Spark NLP and it is not compatible with PySpark 3.x at all! So please make sure you installed spark-nlp==3.1.1 and have your Spark NLP started as follows: (This will take care of everything so no need to have that SparkSession snippet in your code)

Please use this

spark = sparknlp.start()

And remove this part: (if you prefer to construct your own SparkSession instead of sparknlp.start(), you can refer to our example https://github.com/JohnSnowLabs/spark-nlp#pipconda)

spark = SparkSession.builder \
    .master('local[*]') \
    .appName('Spark NLP') \
    .config("spark.jars.packages", "com.johnsnowlabs.nlp:spark-nlp_2.11:2.3.5") \
    .getOrCreate()
  • Please make sure you follow this guide for Windows 8/10. I know it's for PySpark 2.4.x so you can ignore those parts regarding the versions, but the reset of requirements is critical to have models downloaded and loaded correctly on Windows: How to correctly install Spark NLP on Windows 8 and 10 #1022

@dkaenzig
Copy link
Author

dkaenzig commented Jul 5, 2021

Hi Maziyar,

Thanks so much for your prompt reply, this is very helpful!

I tried to follow your steps, in particular, I use now the spark = sparknlp.start() setup as you suggest. Furthermore, I completed some steps that I did not do before outlined in the link you sent #1022 . In particular, I

  • Installed Microsoft Visual C++ 2010 Redistributed Package (x64) https://www.microsoft.com/en-us/download/details.aspx?id=26999
  • Createed C:\tmp and C:\tmp\hive folders and fixed permissions by running: %HADOOP_HOME%\bin\winutils.exe chmod 777 /tmp/hive and %HADOOP_HOME%\bin\winutils.exe chmod 777 /tmp/ . I have tried that before but back then it did not work (probably because I did not have the Microsoft Visual C++ 2010 Redistributed Package installed?) Now it runs silently, but nothing seems to happen in the C:\tmp folders. Is this correct?
  • I changed the spark/java/hadoop user variables to system variables as in your example (probably wont make a difference)

Regarding the permissions to read/write/execute, that is strange because I should be working in folders where I have all these permissions (Btw, the C:/Users/dkaenzig/cache_pretrained exists). To double check, I moved the folder from my dropbox to my desktop and the problem remained unfortunately. Is there a way to check the permissions within spark-nlp?

After all these changes, I tried again but unfortunately, I still get an error:

---------------------------------------------------------------------------
Py4JJavaError                             Traceback (most recent call last)
<ipython-input-9-5f238762f2fd> in <module>
     18 #lemmatizer = LemmatizerModel.load('lemma_nl_2.5.0_2.4_1588532720582')\
     19 #lemmatizer = LemmatizerModel().load("lemma_nl_2.5.0_2.4_1588532720582")\
---> 20 lemmatizer = LemmatizerModel.pretrained() \
     21     .setInputCols(['normalized']) \
     22     .setOutputCol('lemma') \

~\miniconda3\envs\nlpspark\lib\site-packages\sparknlp\annotator.py in pretrained(name, lang, remote_loc)
    686     def pretrained(name="lemma_antbnc", lang="en", remote_loc=None):
    687         from sparknlp.pretrained import ResourceDownloader
--> 688         return ResourceDownloader.downloadModel(LemmatizerModel, name, lang, remote_loc)
    689 
    690 

~\miniconda3\envs\nlpspark\lib\site-packages\sparknlp\pretrained.py in downloadModel(reader, name, language, remote_loc, j_dwn)
     58             except Py4JJavaError as e:
     59                 sys.stdout.write("\n" + str(e))
---> 60                 raise e
     61             finally:
     62                 stop_threads = True

~\miniconda3\envs\nlpspark\lib\site-packages\sparknlp\pretrained.py in downloadModel(reader, name, language, remote_loc, j_dwn)
     55             t1.start()
     56             try:
---> 57                 j_obj = _internal._DownloadModel(reader.name, name, language, remote_loc, j_dwn).apply()
     58             except Py4JJavaError as e:
     59                 sys.stdout.write("\n" + str(e))

~\miniconda3\envs\nlpspark\lib\site-packages\sparknlp\internal.py in __init__(self, reader, name, language, remote_loc, validator)
    189 class _DownloadModel(ExtendedJavaWrapper):
    190     def __init__(self, reader, name, language, remote_loc, validator):
--> 191         super(_DownloadModel, self).__init__("com.johnsnowlabs.nlp.pretrained."+validator+".downloadModel", reader, name, language, remote_loc)
    192 
    193 

~\miniconda3\envs\nlpspark\lib\site-packages\sparknlp\internal.py in __init__(self, java_obj, *args)
    142         super(ExtendedJavaWrapper, self).__init__(java_obj)
    143         self.sc = SparkContext._active_spark_context
--> 144         self._java_obj = self.new_java_obj(java_obj, *args)
    145         self.java_obj = self._java_obj
    146 

~\miniconda3\envs\nlpspark\lib\site-packages\sparknlp\internal.py in new_java_obj(self, java_class, *args)
    152 
    153     def new_java_obj(self, java_class, *args):
--> 154         return self._new_java_obj(java_class, *args)
    155 
    156     def new_java_array(self, pylist, java_class):

~\miniconda3\envs\nlpspark\lib\site-packages\pyspark\ml\wrapper.py in _new_java_obj(java_class, *args)
     64             java_obj = getattr(java_obj, name)
     65         java_args = [_py2java(sc, arg) for arg in args]
---> 66         return java_obj(*java_args)
     67 
     68     @staticmethod

~\miniconda3\envs\nlpspark\lib\site-packages\py4j\java_gateway.py in __call__(self, *args)
   1302 
   1303         answer = self.gateway_client.send_command(command)
-> 1304         return_value = get_return_value(
   1305             answer, self.gateway_client, self.target_id, self.name)
   1306 

~\miniconda3\envs\nlpspark\lib\site-packages\pyspark\sql\utils.py in deco(*a, **kw)
    109     def deco(*a, **kw):
    110         try:
--> 111             return f(*a, **kw)
    112         except py4j.protocol.Py4JJavaError as e:
    113             converted = convert_exception(e.java_exception)

~\miniconda3\envs\nlpspark\lib\site-packages\py4j\protocol.py in get_return_value(answer, gateway_client, target_id, name)
    324             value = OUTPUT_CONVERTER[type](answer[2:], gateway_client)
    325             if answer[1] == REFERENCE_TYPE:
--> 326                 raise Py4JJavaError(
    327                     "An error occurred while calling {0}{1}{2}.\n".
    328                     format(target_id, ".", name), value)

Py4JJavaError: An error occurred while calling z:com.johnsnowlabs.nlp.pretrained.PythonResourceDownloader.downloadModel.
: java.lang.UnsatisfiedLinkError: org.apache.hadoop.io.nativeio.NativeIO$Windows.access0(Ljava/lang/String;I)Z
	at org.apache.hadoop.io.nativeio.NativeIO$Windows.access0(Native Method)
	at org.apache.hadoop.io.nativeio.NativeIO$Windows.access(NativeIO.java:645)
	at org.apache.hadoop.fs.FileUtil.canRead(FileUtil.java:1230)
	at org.apache.hadoop.fs.FileUtil.list(FileUtil.java:1435)
	at org.apache.hadoop.fs.RawLocalFileSystem.listStatus(RawLocalFileSystem.java:493)
	at org.apache.hadoop.fs.FileSystem.listStatus(FileSystem.java:1868)
	at org.apache.hadoop.fs.FileSystem.listStatus(FileSystem.java:1910)
	at org.apache.hadoop.fs.FileSystem$4.<init>(FileSystem.java:2072)
	at org.apache.hadoop.fs.FileSystem.listLocatedStatus(FileSystem.java:2071)
	at org.apache.hadoop.fs.ChecksumFileSystem.listLocatedStatus(ChecksumFileSystem.java:700)
	at org.apache.hadoop.mapred.FileInputFormat.singleThreadedListStatus(FileInputFormat.java:278)
	at org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:239)
	at org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:325)
	at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:205)
	at org.apache.spark.rdd.RDD.$anonfun$partitions$2(RDD.scala:300)
	at scala.Option.getOrElse(Option.scala:189)
	at org.apache.spark.rdd.RDD.partitions(RDD.scala:296)
	at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:49)
	at org.apache.spark.rdd.RDD.$anonfun$partitions$2(RDD.scala:300)
	at scala.Option.getOrElse(Option.scala:189)
	at org.apache.spark.rdd.RDD.partitions(RDD.scala:296)
	at org.apache.spark.rdd.RDD.$anonfun$take$1(RDD.scala:1428)
	at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
	at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
	at org.apache.spark.rdd.RDD.withScope(RDD.scala:414)
	at org.apache.spark.rdd.RDD.take(RDD.scala:1422)
	at org.apache.spark.rdd.RDD.$anonfun$first$1(RDD.scala:1463)
	at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
	at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
	at org.apache.spark.rdd.RDD.withScope(RDD.scala:414)
	at org.apache.spark.rdd.RDD.first(RDD.scala:1463)
	at org.apache.spark.ml.util.DefaultParamsReader$.loadMetadata(ReadWrite.scala:587)
	at org.apache.spark.ml.util.DefaultParamsReader.load(ReadWrite.scala:465)
	at com.johnsnowlabs.nlp.FeaturesReader.load(ParamsAndFeaturesReadable.scala:12)
	at com.johnsnowlabs.nlp.FeaturesReader.load(ParamsAndFeaturesReadable.scala:8)
	at com.johnsnowlabs.nlp.pretrained.ResourceDownloader$.downloadModel(ResourceDownloader.scala:381)
	at com.johnsnowlabs.nlp.pretrained.ResourceDownloader$.downloadModel(ResourceDownloader.scala:375)
	at com.johnsnowlabs.nlp.pretrained.PythonResourceDownloader$.downloadModel(ResourceDownloader.scala:499)
	at com.johnsnowlabs.nlp.pretrained.PythonResourceDownloader.downloadModel(ResourceDownloader.scala)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
	at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
	at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
	at py4j.Gateway.invoke(Gateway.java:282)
	at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
	at py4j.commands.CallCommand.execute(CallCommand.java:79)
	at py4j.GatewayConnection.run(GatewayConnection.java:238)
	at java.lang.Thread.run(Thread.java:748)

Similarly for trying to load the module locally:

---------------------------------------------------------------------------
Py4JJavaError                             Traceback (most recent call last)
<ipython-input-10-60e5ca4666d8> in <module>
----> 1 lemmatizer = LemmatizerModel().load("lemma_nl_2.5.0_2.4_1588532720582")\
      2     .setInputCols(['normalized']) \
      3     .setOutputCol('lemma')

~\miniconda3\envs\nlpspark\lib\site-packages\pyspark\ml\util.py in load(cls, path)
    330     def load(cls, path):
    331         """Reads an ML instance from the input path, a shortcut of `read().load(path)`."""
--> 332         return cls.read().load(path)
    333 
    334 

~\miniconda3\envs\nlpspark\lib\site-packages\pyspark\ml\util.py in load(self, path)
    280         if not isinstance(path, str):
    281             raise TypeError("path should be a string, got type %s" % type(path))
--> 282         java_obj = self._jread.load(path)
    283         if not hasattr(self._clazz, "_from_java"):
    284             raise NotImplementedError("This Java ML type cannot be loaded into Python currently: %r"

~\miniconda3\envs\nlpspark\lib\site-packages\py4j\java_gateway.py in __call__(self, *args)
   1302 
   1303         answer = self.gateway_client.send_command(command)
-> 1304         return_value = get_return_value(
   1305             answer, self.gateway_client, self.target_id, self.name)
   1306 

~\miniconda3\envs\nlpspark\lib\site-packages\pyspark\sql\utils.py in deco(*a, **kw)
    109     def deco(*a, **kw):
    110         try:
--> 111             return f(*a, **kw)
    112         except py4j.protocol.Py4JJavaError as e:
    113             converted = convert_exception(e.java_exception)

~\miniconda3\envs\nlpspark\lib\site-packages\py4j\protocol.py in get_return_value(answer, gateway_client, target_id, name)
    324             value = OUTPUT_CONVERTER[type](answer[2:], gateway_client)
    325             if answer[1] == REFERENCE_TYPE:
--> 326                 raise Py4JJavaError(
    327                     "An error occurred while calling {0}{1}{2}.\n".
    328                     format(target_id, ".", name), value)

Py4JJavaError: An error occurred while calling o74.load.
: java.lang.UnsatisfiedLinkError: org.apache.hadoop.io.nativeio.NativeIO$Windows.access0(Ljava/lang/String;I)Z
	at org.apache.hadoop.io.nativeio.NativeIO$Windows.access0(Native Method)
	at org.apache.hadoop.io.nativeio.NativeIO$Windows.access(NativeIO.java:645)
	at org.apache.hadoop.fs.FileUtil.canRead(FileUtil.java:1230)
	at org.apache.hadoop.fs.FileUtil.list(FileUtil.java:1435)
	at org.apache.hadoop.fs.RawLocalFileSystem.listStatus(RawLocalFileSystem.java:493)
	at org.apache.hadoop.fs.FileSystem.listStatus(FileSystem.java:1868)
	at org.apache.hadoop.fs.FileSystem.listStatus(FileSystem.java:1910)
	at org.apache.hadoop.fs.FileSystem$4.<init>(FileSystem.java:2072)
	at org.apache.hadoop.fs.FileSystem.listLocatedStatus(FileSystem.java:2071)
	at org.apache.hadoop.fs.ChecksumFileSystem.listLocatedStatus(ChecksumFileSystem.java:700)
	at org.apache.hadoop.mapred.FileInputFormat.singleThreadedListStatus(FileInputFormat.java:278)
	at org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:239)
	at org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:325)
	at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:205)
	at org.apache.spark.rdd.RDD.$anonfun$partitions$2(RDD.scala:300)
	at scala.Option.getOrElse(Option.scala:189)
	at org.apache.spark.rdd.RDD.partitions(RDD.scala:296)
	at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:49)
	at org.apache.spark.rdd.RDD.$anonfun$partitions$2(RDD.scala:300)
	at scala.Option.getOrElse(Option.scala:189)
	at org.apache.spark.rdd.RDD.partitions(RDD.scala:296)
	at org.apache.spark.rdd.RDD.$anonfun$take$1(RDD.scala:1428)
	at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
	at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
	at org.apache.spark.rdd.RDD.withScope(RDD.scala:414)
	at org.apache.spark.rdd.RDD.take(RDD.scala:1422)
	at org.apache.spark.rdd.RDD.$anonfun$first$1(RDD.scala:1463)
	at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
	at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
	at org.apache.spark.rdd.RDD.withScope(RDD.scala:414)
	at org.apache.spark.rdd.RDD.first(RDD.scala:1463)
	at org.apache.spark.ml.util.DefaultParamsReader$.loadMetadata(ReadWrite.scala:587)
	at org.apache.spark.ml.util.DefaultParamsReader.load(ReadWrite.scala:465)
	at com.johnsnowlabs.nlp.FeaturesReader.load(ParamsAndFeaturesReadable.scala:12)
	at com.johnsnowlabs.nlp.FeaturesReader.load(ParamsAndFeaturesReadable.scala:8)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
	at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
	at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
	at py4j.Gateway.invoke(Gateway.java:282)
	at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
	at py4j.commands.CallCommand.execute(CallCommand.java:79)
	at py4j.GatewayConnection.run(GatewayConnection.java:238)
	at java.lang.Thread.run(Thread.java:748)

The error seems to be different though now, seems related to loading Java? Do you have any idea what the problem could be?

Many thanks for your help!

Best,

Diego

@dkaenzig
Copy link
Author

dkaenzig commented Jul 5, 2021

Update: I am not sure what solved the problem but:

  1. I also downloaded the hadoop.dll as suggested here: https://stackoverflow.com/questions/41851066/exception-in-thread-main-java-lang-unsatisfiedlinkerror-org-apache-hadoop-io

  2. I set all the paths also under system and not user

  3. I fixed permissions: %HADOOP_HOME%\bin\winutils.exe chmod 777 /tmp/hive
    %HADOOP_HOME%\bin\winutils.exe chmod 777 /tmp/

  4. And I restarted my machine (I tried that before, so it was not a problem of purely restarting)

And after all that, it seems to work now! A bit disappointing that I still do not know what the problem exactly was but at least its working.

Thanks so much again for all the help! Best,

Diego

@dkaenzig dkaenzig closed this as completed Jul 5, 2021
@maziyarpanahi
Copy link
Member

I am glad it worked out, best of luck

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants