Dataset class
A helper class for data pre-processing. Automates steps such as binarizing and scaling data, placing target and protected class in the right position, checks for nulls and allowed column types.
This is an optional step, if passing pre-processed data into the Model class make sure to place the protected variable column first and target variable column last.
Function | Description |
---|---|
init | Intializes instance of Dataset class |
pre_process | This step is optional and does a couple simple pre-processing check and adjustments, such as scaling, checking for nulls, setting protected and target columns |
post_process | Run on generated dataset to inverse scaling |
get_protected_distribution | Gets protected variable distribution from the dataset |
get_target_distribution | Gets target variable distribution from the dataset |
Dataset
CLASS
Dataset
()
init
Method
__init__(dataframe)
Initializes dataset class given a Pandas Dataframe.
Parameters
- dataframe [Pandas DataFrame]: Training data before pre-processing
- scaler [sklearn.preprocessing.MinMaxScaler]: initialize the MinMaxScaler object that will be used to scale data in pre-processing and post-processing
- np_data [numpy.ndarray]: Processed data in a Numpy array
- target_names List[str]: Target variable column names
- protected_names List[str]: Protected variable column names
- processed_col_types List[str]: store list for the original dataset column types, used in post-processing
Return type
None
pre_process
Method
pre_process(protected_var, outcome_var, output_file_name_path, multiclass=False, min_max_scale=True)
Basic pre-processing on a Pandas DataFrame including one-hot encoding, scaling, checking for nulls, etc. Saves a pickle file with a numpy array and a csv file with a data dictionary in the specified path.
Parameters
- protected_var [str] - Name of the protected column in the Pandas DataFrame
- outcome_var [str] - Name of the outcome column in the Pandas DataFrame
- output_file_name_path [str] - Name the Pickle file that will be saved in the data/interim folder
- multiclass (Optional [bool]) - Set to True if your protected variable is categorical has more than two states.
- min_max_scale (Optional [bool]) - Set to False if using scaled data
Raises
Exception - if dataset has nulls
Return type
- np_data [numpy.ndarray]: numpy array with pre-processed data
post_process
Method
post_process(gen_data_np)
Inverse scaling on the generated data from the trained model.
Parameters
- gen_data_np [np.ndarray] - numpy array with generated data
Return type
- gen_data_np [numpy.ndarray]: numpy array with post-processed data
get_protected_distribution
Method
get_protected_distribution(np_data)
Calculates the protected variable distribution after pre-processing.
Parameters
- np_data [np.ndarray] - data in a numpy array
Return type
- protected_distribution [List[float]]: distrbution of each protected class
To get the dataframe columns corresponding to class names, run protected_names
on the Dataset object. For example
dataset = Dataset()
np_data = dataset.pre_process(df, 'gender', 'income', 'out_file')
dataset.get_protected_distribution(np_data)
>>> [85.6, 14.4]
dataset.protected_names
>>> ['Male', 'Female']
get_target_distribution
Method
get_target_distribution(np_data)
Returns the target variable distribution after pre-processing.
Parameters
- np_data [np.ndarray] - data in a numpy array
Return type
- target_distribution [List[float]]: distrbution of each target class
To get the dataframe columns corresponding to class names, return
target_names
on the Dataset object.