Scoring Code on Spark

Using Scoring Code on Spark is similar to using it without Spark. The main difference is that SparkScoringCodeModel is used instead of ScoringCodeModel.

Loading from filesystem

Loading a model jar from filesystem makes it easy to use Scoring Code on Spark. Models can be loaded dynamically without having to worry about add jars to Sparks classpath or restarting Spark. A model is loaded from filesystem by specifying the path to the model jar

from datarobot_predict.spark_scoring_code import SparkScoringCodeModel

model = SparkScoringCodeModel("model.jar")

To avoid undefined behavior from multiple versions of Scoring Code Java classes, it is recommended not to add any Scoring Code model jars or Scoring Code API/Spark API jars to the classpath when models are loaded from the filesystem. The library will try to enforce this by not allowing a SparkScoringCodeModel to be instantiated if it detects the Scoring Code API in the classpath. This behavior can be turned off using the constructor parameter allow_models_in_classpath

model = SparkScoringCodeModel("model.jar", allow_models_in_classpath=True)

Loading from classpath

If required, a model can be loaded from the classpath by not specifying a jar path

model = SparkScoringCodeModel()

It is recommended not to work with models from filesystem and in classpath at the same time for the reasons mentioned in the section before.

Scoring

Models can score Spark DataFrames

spark_df = spark.read.csv("input.csv")
result = model.predict(spark_df)
result.show()

It is also possible to score a pandas DataFrame. The result is returned as a Spark DataFrame.

pandas_df = pd.read_csv("input.csv")
spark_result = model.predict(pandas_df)