Accessors API¶
df.ml. Methods¶
apply_df_train_transform
¶
apply_df_train_transform
(ml_transform: MLTransform) -> pd.Series [source]
Parameters
ml_transform:
pandas_toolkit.ml.MLTransform
object containing transform to apply and column name to be applied to (normally via e.g.df_train.ml.transforms["standard_scaler"]
).
Returns
Transformed featured using e.g.
df_train
data statistics.
Examples
>>> df_train = pd.DataFrame({"x": [0, 1],
"y": [0, 1]},
index=[0, 1])
>>> df_validation = pd.DataFrame({"x": [2],
"y": [2]},
index=[0])
>>> df_train["standard_scaler_x"] = df_train.ml.standard_scaler(column="x")
>>> df_train["standard_scaler_x"]
pd.Series([-1, 1])
>>> df_train.ml.transforms
{'standard_scaler': <pandas_toolkit.ml.MLTransform object at 0x7f1af20f0af0>}
>>> df_validation["standard_scaler_x"] = \
df_validation.ml.apply_df_train_transform(df_train.ml.transforms["standard_scaler"])
>>> df_validation["standard_scaler_x"]
pd.Series([3])
standard_scaler
¶
standard_scaler
(column: str) -> pd.Series [source]
Parameters
column: Column denoting feature to standardize.
Returns
Standardized featured by removing the mean and scaling to unit variance (via scikit-learn):
z = (x - u) / s
(u
:= mean of training samples,s
:= standard deviation of training samples).
Side Effects
Updates the
df.ml.transforms
dictionary with key "standard_scaler" and valuepandas_toolkit.ml.MLTransform
corresponding to the column name and fittedsklearn.preprocessing.StandardScaler
object.
Examples
>>> df = pd.DataFrame({"x": [0, 1],
"y": [0, 1]},
index=[0, 1])
>>> df["standard_scaler_x"] = df.ml.standard_scaler(column="x")
>>> df["standard_scaler_x"]
pd.Series([-1, 1])
>>> df.ml.transforms
{'standard_scaler': <pandas_toolkit.ml.MLTransform object at 0x7f1af20f0af0>}
train_validation_split
¶
train_validation_split
(train_frac: float, random_seed: int = None) -> Tuple[pd.DataFrame, pd.DataFrame] [source]
Parameters
train_frac: Fraction of rows to be added to df_train.
random_seed: Seed for the random number generator (e.g. for reproducible splits).
Returns
df_train
anddf_validation
, split from the original dataframe.
Examples
>>> df = pd.DataFrame({"x": [0, 1, 2],
"y": [0, 1, 2]},
index=[0, 1, 2])
>>> df_train, df_validation = df.ml.train_validation_split(train_frac=2/3)
>>> df_train
pd.DataFrame({"x": [0, 1], "y": [0, 1]}, index=[0, 1]),
>>> df_validation
pd.DataFrame({"x": [2], "y": [2]}, index=[2])
df.nn. Methods¶
init
¶
init
(x_columns: List[str], y_columns: List[str], net_function: Callable[[jnp.ndarray] jnp.ndarray], loss: str, optimizer: InitUpdate = optix.adam(learning_rate=1e-3), batch_size: int = None, apply_rng: bool = False, rng_seed: int = random.randint(0, int(1e15))) -> pd.DataFrame [source]
Parameters
x_columns: Columns to be used as input for the model.
y_columns: Columns to be used as output for the model.
net_function: A function that defines a dm-haiku neural network and how to predict uses it (this function is passed to hk.transform). This should have the signature
net_function(x: jnp.ndarray) -> jnp.ndarray
.loss: Loss function to use. See available loss functions in jax_toolkit.
optimizer: Optimizer to use. See jax.
batch_size: Batch size to use. If not specified, the number of rows in the entire dataframe is used.
apply_rng: If your
net_function
is non-deterministic, set this value toTrue
to enable you model to predict with randomness.rng_seed: Set a seed for reprodubility.
Returns
df
containing a neural network model ready for training with pandas_toolkit.
Examples
>>> def net_function(x: jnp.ndarray) -> jnp.ndarray:
... net = hk.Sequential([relu])
... predictions: jnp.ndarray = net(x)
... return predictions
>>> df_train = df_train.nn.init(x_columns=["x"],
... y_columns=["y"],
... net_function=net_function,
... loss="mean_squared_error")
>>> for _ in range(10): # num_epochs
... df_train = df_train.nn.update(df_validation_to_plot=df_validation)
get_model
¶
get_model
() -> Model [source]
Returns
A
pandas_toolkit.nn.Model
object. As this is not linked to apd.DataFrame
, it is much more lightweight and could be used in e.g. a production setting.
Examples
>>> model = df_train.nn.get_model()
>>> model.predict(x=jnp.array([42]))
hvplot_losses
¶
hvplot_losses
() -> None [source]
Returns
A Holoviews object for interactive (via Bokeh), real-time ploting of training and validation loss curves. For an example usage, see this notebook.
Examples
>>> df_train.nn.hvplot_losses()
update
¶
update
(df_validation_to_plot: pd.DataFrame = None) -> pd.DataFrame [source]
Parameters
df_validation_to_plot: Validation data to evaluate and update loss curve with.
Returns
df
containing an updated neural network model (trained on one extra epoch).
Examples
>>> for _ in range(10): # num_epochs
... df_train = df_train.nn.update(df_validation_to_plot=df_validation)
predict
¶
predict
(x_columns: List[str] = None, batch_size: int = None) -> jnp.ndarray [source]
Parameters
x_columns: Columns to predict on. If
None
, the samex_columns
names used to train the model are used.batch_size: Batch size to use. If not specified, the entire dataset is used.
Returns
A pd.Series of predictions.
Examples
>>> df_new = pd.DataFrame({"x": [-10, -5, 22]})
>>> df_new.model = df_train.nn.get_model()
>>> df_new["predictions"] = df_new.nn.predict()
evaluate
¶
evaluate
(x_columns: List[str] = None, y_columns: List[str] = None, batch_size: int = None) -> jnp.ndarray [source]
Parameters
x_columns: Columns to predict on. If
None
, the samex_columns
names used to train the model are used.y_columns: Columns with true output values to compare predicted values with. If
None
, the samey_columns
names used to train the model are used.batch_size: Batch size to use. If not specified, the entire dataset is used.
Returns
Evaluation of the prediction using the
loss_function
provided indf.nn.init(...)
.
Examples
>>> df_test = pd.DataFrame({"x": [-1, 0, 1], "y": [0, 0, 1]})
>>> df_test.model = df_train.nn.get_model()
>>> df_test.nn.evaluate()