SparkIngestModel#
- class datarobotx.SparkIngestModel(base_model, dataset_name=None, sampling_strategy='uniform')[source]#
Train on a Spark dataframe
Ingests a Spark dataframe into DataRobot for model training, downsampling if needed.
An AI catalog entry will automatically be created for the ingested data and Autopilot will subsequently be orchestrated as normal.
- Parameters:
base_model (AutopilotModel or IntraProjectModel) – Base model for orchestrating Autopilot after feature discovery. Clustering and AutoTS are not supported.
dataset_name (str) – Name for the automatically-created AI Catalog entry containing the ingested data from Spark
sampling_strategy ({'uniform', 'most_recent', 'smart', 'smart_zero_inflated'}, default='uniform') –
Downsampling strategy to be used if sampling is needed to meet ingest limit. When using smart sampling, training weights will be calculated and stored in the column ‘dr_sample_weights’ and automatically used at fit-time.
’smart’ sampling requires a target variable to be passed at fit-time and ‘most_recent’ sampling requires a datetime_partition_column at fit-time.
Notes
‘uniform’ samples uniformly at random from the provided dataframe
‘most_recent’ samples after ordering the data by the ‘datetime_partition_column’
‘smart’ samples attempting to preserve as many minority target examples as possible
‘smart_zero_inflated’ performs smart sampling, but treats all non-zero values as the same class
Inherited attributes:
Base model used for fitting
DataRobot python client datarobot.Model object for the present champion
DataRobot python client datarobot.Project object
Methods:
fit(X, *args, **kwargs)Fit model from a Spark dataframe
Inherited methods:
deploy([wait_for_autopilot, name])Deploy the model into ML Ops
Configuration parameters for the intra-project model
predict(X[, wait_for_autopilot])Make batch predictions using the present champion
predict_proba(X[, wait_for_autopilot])Calculate class probabilities using the present champion
set_params(**kwargs)Set configuration parameters for the intra-project model
share(emails)Share a project with other users.
- property base_model: ModelOperator#
Base model used for fitting
- Returns:
Base model instance
- Return type:
AutopilotModel or IntraProjectModel
- deploy(wait_for_autopilot=False, name=None)[source]#
Deploy the model into ML Ops
- Return type:
- Returns:
Deployment – Resulting ML Ops deployment
wait_for_autopilot (bool, optional, default=False) – If True, wait for autopilot to complete before deploying the model In non-notebook environments, fit() will always block until complete
name (str, optional, default=None) – Name for the deployment. If None, a name will be generated
- property dr_model: datarobot.Model#
DataRobot python client datarobot.Model object for the present champion
- Returns:
datarobot.Model object associated with this drx model
- Return type:
datarobot.Model
- property dr_project: datarobot.Project#
DataRobot python client datarobot.Project object
- Returns:
datarobot.Project object associated with this drx.Model
- Return type:
datarobot.Project
- fit(X, *args, **kwargs)[source]#
Fit model from a Spark dataframe
- Parameters:
X (pyspark.sql.DataFrame) – Training dataset to be ingested
*args – Positional arguments to be passed to the base model fit()
**kwargs – Keyword arguments to be passed to the base model fit()
- get_params()[source]#
Configuration parameters for the intra-project model
- Returns:
config – Configuration object containing the parameters for intra project model
- Return type:
Notes
Access configuration parameters for the underlying base model by calling get_params() on the base_model attribute
- predict(X, wait_for_autopilot=False)[source]#
Make batch predictions using the present champion
Predictions are calculated asynchronously - returns immediately but reinitializes the returned DataFrame with data once predictions are completed.
Predictions are made within the project containing the model using modeling workers. For real-time predictions, first deploy the model.
- Parameters:
X (pandas.DataFrame) – Dataset to be scored - target column can be included or omitted
wait_for_autopilot (bool, optional, default=False) – If True, wait for autopilot to complete before making predictions In non-notebook environments, fit() will always block until complete
- Returns:
Resulting predictions (contained in the column ‘predictions’) Returned immediately, updated automatically when results are completed.
- Return type:
FutureDataFrame
- predict_proba(X, wait_for_autopilot=False)[source]#
Calculate class probabilities using the present champion
Only available for classifier and clustering models.
- Parameters:
X (pandas.DataFrame) – Dataset to compute class probabilities on; target column can be included or omitted
wait_for_autopilot (bool, optional, default=False) – If True, wait for autopilot to complete before making predictions In non-notebook environments, fit() will always block until complete
- Returns:
Resulting predictions; probabilities for each label are contained in the column ‘class_{label}’; returned immediately, updated automatically when results are completed.
- Return type:
FutureDataFrame
See also
- set_params(**kwargs)[source]#
Set configuration parameters for the intra-project model
- Parameters:
**kwargs – Configuration parameters to be set or updated for this model.
- Returns:
self – IntraProjectModel instance
- Return type:
IntraProjectModel
Notes
Configuration parameters for the underlying base model can be set by calling set_params() on the base_model attribute