Scoring Code on Spark
Using Scoring Code on Spark is similar to using it without Spark. The main difference is that SparkScoringCodeModel is used instead of ScoringCodeModel.
Loading from filesystem
Loading a model jar from filesystem makes it easy to use Scoring Code on Spark. Models can be loaded dynamically without having to worry about add jars to Sparks classpath or restarting Spark. A model is loaded from filesystem by specifying the path to the model jar
from datarobot_predict.spark_scoring_code import SparkScoringCodeModel
model = SparkScoringCodeModel("model.jar")
To avoid undefined behavior from multiple versions of Scoring Code Java classes, it is
recommended not to add any Scoring Code model jars or Scoring Code API/Spark API jars
to the classpath when models are loaded from the filesystem.
The library will try to enforce this by not allowing a SparkScoringCodeModel
to be
instantiated if it detects the Scoring Code API in the classpath.
This behavior can be turned off using the constructor parameter allow_models_in_classpath
model = SparkScoringCodeModel("model.jar", allow_models_in_classpath=True)
Loading from classpath
If required, a model can be loaded from the classpath by not specifying a jar path
model = SparkScoringCodeModel()
It is recommended not to work with models from filesystem and in classpath at the same time for the reasons mentioned in the section before.
Scoring
Models can score Spark DataFrames
spark_df = spark.read.csv("input.csv")
result = model.predict(spark_df)
result.show()
It is also possible to score a pandas DataFrame. The result is returned as a Spark DataFrame.
pandas_df = pd.read_csv("input.csv")
spark_result = model.predict(pandas_df)