SparkIngestModel#

class datarobotx.SparkIngestModel(base_model, dataset_name=None, sampling_strategy='uniform')[source]#

Train on a Spark dataframe

Ingests a Spark dataframe into DataRobot for model training, downsampling if needed.

An AI catalog entry will automatically be created for the ingested data and Autopilot will subsequently be orchestrated as normal.

Parameters:

base_model (AutopilotModel or IntraProjectModel) – Base model for orchestrating Autopilot after feature discovery. Clustering and AutoTS are not supported.
dataset_name (str) – Name for the automatically-created AI Catalog entry containing the ingested data from Spark
sampling_strategy ({'uniform', 'most_recent', 'smart', 'smart_zero_inflated'}, default='uniform') –
Downsampling strategy to be used if sampling is needed to meet ingest limit. When using smart sampling, training weights will be calculated and stored in the column ‘dr_sample_weights’ and automatically used at fit-time.

’smart’ sampling requires a target variable to be passed at fit-time and ‘most_recent’ sampling requires a datetime_partition_column at fit-time.

Notes

‘uniform’ samples uniformly at random from the provided dataframe

‘most_recent’ samples after ordering the data by the ‘datetime_partition_column’

‘smart’ samples attempting to preserve as many minority target examples as possible

‘smart_zero_inflated’ performs smart sampling, but treats all non-zero values as the same class

Inherited attributes:

`base_model`	Base model used for fitting
`dr_model`	DataRobot python client datarobot.Model object for the present champion
`dr_project`	DataRobot python client datarobot.Project object

Methods:

fit(X, *args, **kwargs)

Fit model from a Spark dataframe

Inherited methods:

`deploy`([wait_for_autopilot, name])	Deploy the model into ML Ops
`get_params`()	Configuration parameters for the intra-project model
`predict`(X[, wait_for_autopilot])	Make batch predictions using the present champion
`predict_proba`(X[, wait_for_autopilot])	Calculate class probabilities using the present champion
`set_params`(**kwargs)	Set configuration parameters for the intra-project model
`share`(emails)	Share a project with other users.

property base_model: ModelOperator#

Base model used for fitting

Returns:: Base model instance
Return type:: AutopilotModel or IntraProjectModel

deploy(wait_for_autopilot=False, name=None)[source]#

Deploy the model into ML Ops

Return type:

Deployment

Returns:

Deployment – Resulting ML Ops deployment
wait_for_autopilot (bool, optional, default=False) – If True, wait for autopilot to complete before deploying the model In non-notebook environments, fit() will always block until complete
name (str, optional, default=None) – Name for the deployment. If None, a name will be generated

property dr_model: datarobot.Model#

DataRobot python client datarobot.Model object for the present champion

Returns:: datarobot.Model object associated with this drx model
Return type:: datarobot.Model

property dr_project: datarobot.Project#

DataRobot python client datarobot.Project object

Returns:: datarobot.Project object associated with this drx.Model
Return type:: datarobot.Project

fit(X, *args, **kwargs)[source]#

Fit model from a Spark dataframe

Parameters:

X (pyspark.sql.DataFrame) – Training dataset to be ingested
*args – Positional arguments to be passed to the base model fit()
**kwargs – Keyword arguments to be passed to the base model fit()

get_params()[source]#

Configuration parameters for the intra-project model

Returns:: config – Configuration object containing the parameters for intra project model
Return type:: dict

Notes

Access configuration parameters for the underlying base model by calling get_params() on the base_model attribute

predict(X, wait_for_autopilot=False)[source]#

Make batch predictions using the present champion

Predictions are calculated asynchronously - returns immediately but reinitializes the returned DataFrame with data once predictions are completed.

Predictions are made within the project containing the model using modeling workers. For real-time predictions, first deploy the model.

Parameters:

X (pandas.DataFrame) – Dataset to be scored - target column can be included or omitted
wait_for_autopilot (bool, optional, default=False) – If True, wait for autopilot to complete before making predictions In non-notebook environments, fit() will always block until complete

Returns:

Resulting predictions (contained in the column ‘predictions’) Returned immediately, updated automatically when results are completed.

Return type:

FutureDataFrame

predict_proba(X, wait_for_autopilot=False)[source]#

Calculate class probabilities using the present champion

Only available for classifier and clustering models.

Parameters:

X (pandas.DataFrame) – Dataset to compute class probabilities on; target column can be included or omitted
wait_for_autopilot (bool, optional, default=False) – If True, wait for autopilot to complete before making predictions In non-notebook environments, fit() will always block until complete

Returns:

Resulting predictions; probabilities for each label are contained in the column ‘class_{label}’; returned immediately, updated automatically when results are completed.

Return type:

FutureDataFrame