Skip to main content Link Menu Expand (external link) Document Search Copy Copied

Dataset class

A helper class for data pre-processing. Automates steps such as binarizing and scaling data, placing target and protected class in the right position, checks for nulls and allowed column types.

This is an optional step, if passing pre-processed data into the Model class make sure to place the protected variable column first and target variable column last.

Function Description
init Intializes instance of Dataset class
pre_process This step is optional and does a couple simple pre-processing check and adjustments, such as scaling, checking for nulls, setting protected and target columns
post_process Run on generated dataset to inverse scaling
get_protected_distribution Gets protected variable distribution from the dataset
get_target_distribution Gets target variable distribution from the dataset

Dataset

CLASS

Dataset()

init

Method

__init__(dataframe) Initializes dataset class given a Pandas Dataframe.

Parameters

  • dataframe [Pandas DataFrame]: Training data before pre-processing
  • scaler [sklearn.preprocessing.MinMaxScaler]: initialize the MinMaxScaler object that will be used to scale data in pre-processing and post-processing
  • np_data [numpy.ndarray]: Processed data in a Numpy array
  • target_names List[str]: Target variable column names
  • protected_names List[str]: Protected variable column names
  • processed_col_types List[str]: store list for the original dataset column types, used in post-processing

Return type

None

pre_process

Method

pre_process(protected_var, outcome_var, output_file_name_path, multiclass=False, min_max_scale=True)

Basic pre-processing on a Pandas DataFrame including one-hot encoding, scaling, checking for nulls, etc. Saves a pickle file with a numpy array and a csv file with a data dictionary in the specified path.

Parameters

  • protected_var [str] - Name of the protected column in the Pandas DataFrame
  • outcome_var [str] - Name of the outcome column in the Pandas DataFrame
  • output_file_name_path [str] - Name the Pickle file that will be saved in the data/interim folder
  • multiclass (Optional [bool]) - Set to True if your protected variable is categorical has more than two states.
  • min_max_scale (Optional [bool]) - Set to False if using scaled data

Raises

Exception - if dataset has nulls

Return type

  • np_data [numpy.ndarray]: numpy array with pre-processed data

post_process

Method

post_process(gen_data_np)

Inverse scaling on the generated data from the trained model.

Parameters

  • gen_data_np [np.ndarray] - numpy array with generated data

Return type

  • gen_data_np [numpy.ndarray]: numpy array with post-processed data

get_protected_distribution

Method

get_protected_distribution(np_data)

Calculates the protected variable distribution after pre-processing.

Parameters

  • np_data [np.ndarray] - data in a numpy array

Return type

  • protected_distribution [List[float]]: distrbution of each protected class

To get the dataframe columns corresponding to class names, run protected_names on the Dataset object. For example

dataset = Dataset()
np_data = dataset.pre_process(df, 'gender', 'income', 'out_file')

dataset.get_protected_distribution(np_data)

>>> [85.6, 14.4]

dataset.protected_names

>>> ['Male', 'Female']

get_target_distribution

Method

get_target_distribution(np_data)

Returns the target variable distribution after pre-processing.

Parameters

  • np_data [np.ndarray] - data in a numpy array

Return type

  • target_distribution [List[float]]: distrbution of each target class

To get the dataframe columns corresponding to class names, return target_names on the Dataset object.