Accessors API¶

df.ml. Methods¶

`apply_df_train_transform`¶

apply_df_train_transform(ml_transform: MLTransform) -> pd.Series [source]

Parameters

ml_transform: pandas_toolkit.ml.MLTransform object containing transform to apply and column name to be applied to (normally via e.g. df_train.ml.transforms["standard_scaler"]).

Returns

Transformed featured using e.g. df_train data statistics.

Examples

>>> df_train = pd.DataFrame({"x": [0, 1],
                             "y": [0, 1]},
                             index=[0, 1])
>>> df_validation = pd.DataFrame({"x": [2],
                                  "y": [2]},
                                  index=[0])
>>> df_train["standard_scaler_x"] = df_train.ml.standard_scaler(column="x")
>>> df_train["standard_scaler_x"]
pd.Series([-1, 1])

>>> df_train.ml.transforms
{'standard_scaler': <pandas_toolkit.ml.MLTransform object at 0x7f1af20f0af0>}

>>> df_validation["standard_scaler_x"] = \
        df_validation.ml.apply_df_train_transform(df_train.ml.transforms["standard_scaler"])
>>> df_validation["standard_scaler_x"]
pd.Series([3])

`standard_scaler`¶

standard_scaler(column: str) -> pd.Series [source]

Parameters

column: Column denoting feature to standardize.

Returns

Standardized featured by removing the mean and scaling to unit variance (via scikit-learn): z = (x - u) / s (u := mean of training samples, s := standard deviation of training samples).

Side Effects

Updates the df.ml.transforms dictionary with key "standard_scaler" and value pandas_toolkit.ml.MLTransform corresponding to the column name and fitted sklearn.preprocessing.StandardScaler object.

Examples

>>> df = pd.DataFrame({"x": [0, 1],
                       "y": [0, 1]},
                       index=[0, 1])
>>> df["standard_scaler_x"] = df.ml.standard_scaler(column="x")
>>> df["standard_scaler_x"]
pd.Series([-1, 1])

>>> df.ml.transforms
{'standard_scaler': <pandas_toolkit.ml.MLTransform object at 0x7f1af20f0af0>}

`train_validation_split`¶

train_validation_split(train_frac: float, random_seed: int = None) -> Tuple[pd.DataFrame, pd.DataFrame] [source]

Parameters

train_frac: Fraction of rows to be added to df_train.

random_seed: Seed for the random number generator (e.g. for reproducible splits).

Returns

df_train and df_validation, split from the original dataframe.

Examples

>>> df = pd.DataFrame({"x": [0, 1, 2],
                       "y": [0, 1, 2]},
                       index=[0, 1, 2])
>>> df_train, df_validation = df.ml.train_validation_split(train_frac=2/3)
>>> df_train
pd.DataFrame({"x": [0, 1], "y": [0, 1]}, index=[0, 1]),
>>> df_validation
pd.DataFrame({"x": [2], "y": [2]}, index=[2])

df.nn. Methods¶

`init`¶

init(x_columns: List[str], y_columns: List[str], net_function: Callable[[jnp.ndarray] jnp.ndarray], loss: str, optimizer: InitUpdate = optix.adam(learning_rate=1e-3), batch_size: int = None, apply_rng: bool = False, rng_seed: int = random.randint(0, int(1e15))) -> pd.DataFrame [source]

Parameters

x_columns: Columns to be used as input for the model.

y_columns: Columns to be used as output for the model.

net_function: A function that defines a dm-haiku neural network and how to predict uses it (this function is passed to hk.transform). This should have the signature net_function(x: jnp.ndarray) -> jnp.ndarray.

loss: Loss function to use. See available loss functions in jax_toolkit.

optimizer: Optimizer to use. See jax.

batch_size: Batch size to use. If not specified, the number of rows in the entire dataframe is used.

apply_rng: If your net_function is non-deterministic, set this value to True to enable you model to predict with randomness.

rng_seed: Set a seed for reprodubility.

Returns

df containing a neural network model ready for training with pandas_toolkit.

Examples

>>> def net_function(x: jnp.ndarray) -> jnp.ndarray:
...     net = hk.Sequential([relu])
...     predictions: jnp.ndarray = net(x)
...     return predictions
>>> df_train = df_train.nn.init(x_columns=["x"],
...                             y_columns=["y"],
...                             net_function=net_function,
...                             loss="mean_squared_error")
>>> for _ in range(10):  # num_epochs
...     df_train = df_train.nn.update(df_validation_to_plot=df_validation)

`get_model`¶

get_model() -> Model [source]

Returns

A pandas_toolkit.nn.Model object. As this is not linked to a pd.DataFrame, it is much more lightweight and could be used in e.g. a production setting.

Examples

>>> model = df_train.nn.get_model()
>>> model.predict(x=jnp.array([42]))

`hvplot_losses`¶

hvplot_losses() -> None [source]

Returns

A Holoviews object for interactive (via Bokeh), real-time ploting of training and validation loss curves. For an example usage, see this notebook.

Examples

>>> df_train.nn.hvplot_losses()

`update`¶

update(df_validation_to_plot: pd.DataFrame = None) -> pd.DataFrame [source]

Parameters

df_validation_to_plot: Validation data to evaluate and update loss curve with.

Returns

df containing an updated neural network model (trained on one extra epoch).

Examples

>>> for _ in range(10):  # num_epochs
...     df_train = df_train.nn.update(df_validation_to_plot=df_validation)

`predict`¶

predict(x_columns: List[str] = None, batch_size: int = None) -> jnp.ndarray [source]

Parameters

x_columns: Columns to predict on. If None, the same x_columns names used to train the model are used.

batch_size: Batch size to use. If not specified, the entire dataset is used.

Returns

A pd.Series of predictions.

Examples

>>> df_new = pd.DataFrame({"x": [-10, -5, 22]})
>>> df_new.model = df_train.nn.get_model()
>>> df_new["predictions"] = df_new.nn.predict()

`evaluate`¶

evaluate(x_columns: List[str] = None, y_columns: List[str] = None, batch_size: int = None) -> jnp.ndarray [source]

Parameters

x_columns: Columns to predict on. If None, the same x_columns names used to train the model are used.

y_columns: Columns with true output values to compare predicted values with. If None, the same y_columns names used to train the model are used.

batch_size: Batch size to use. If not specified, the entire dataset is used.

Returns

Evaluation of the prediction using the loss_function provided in df.nn.init(...).

Examples

>>> df_test = pd.DataFrame({"x": [-1, 0, 1], "y": [0, 0, 1]})
>>> df_test.model = df_train.nn.get_model()
>>> df_test.nn.evaluate()

Accessors API¶

df.ml. Methods¶

apply_df_train_transform¶

standard_scaler¶

train_validation_split¶

df.nn. Methods¶

init¶

get_model¶

hvplot_losses¶

update¶

predict¶

evaluate¶

`apply_df_train_transform`¶

`standard_scaler`¶

`train_validation_split`¶

`init`¶

`get_model`¶

`hvplot_losses`¶

`update`¶

`predict`¶

`evaluate`¶