Fair Transformer GAN class
A class that defines the Fair Transformer GAN model. Processed data from the Dataset object is passed to the Model object, which trains the model and generates less-biased data. The Metrics object then uses this generated data to calculate model performance.
The train folder contains the script to train the model via the command line.
Function | Description |
---|---|
init | Intializes instance of FairTransformerGAN class |
loadData | Load processed data created from Dataset class |
buildAutoencoder | Build autoencoder that encodes the input data |
buildGenerator | Build the generator for training |
buildGeneratorTest | Build the generator when generating new data |
getDiscriminatorResults | Calculate the discriminator predictions |
buildDiscriminator | Build the discriminator |
print2file | Print the training metrics to the log file |
generateData | Generate new data using the trained model |
calculateDiscAuc | Calculate discriminator AUC |
calculateDiscAccuracy | Calculate discriminator accuracy |
calculateGenAccuracy | Calculate generator accuracy |
pair_rd | Calculate total pairwise risk difference |
calculateRD | Calculate risk difference score across all protected attribute classes |
calculateClassifierAccuracy | Calculate classifier accuracy |
calculateClassifierRD | Calculate classifier risk difference score across all protected attribute classes |
create_z_masks | Calculate mask for each protected attribute class |
train | Train model |
FairTransformerGAN
CLASS
FairTransformerGAN
()
init
Method
__init__(dataType='binary',inputDim=58,embeddingDim=32,randomDim=32,generatorDims=(32, 32),discriminatorDims=(32, 16, 1),compressDims=(),decompressDims=(),bnDecay=0.99,l2scale=0.001,lambda_fair=1)
Initializes FairTransformerGAN model with given parameters. Based on MedGAN architecture.
Parameters
- dataType [str]: specifies if the input data contains only binary (0, 1) values or continuous values
- inputDim [int]: number of columns in the input data, not including the protected attribute and outcome columns
- embeddingDim [int]: dimension size of the embedding, which will be generated by the generator
- randomDim [int]: dimension size of the random noise, on which the generator is conditioned
- generatorDims [tuple]: dimensions of the generator. Note that another layer of size “embeddingDim” is always added.
- discriminatorDims [tuple]: dimensions of the discriminator
- compressDims [tuple]: dimensions of the encoder part of the autoencoder. Note that another layer of size “embeddingDim” is always added. Therefore this can be a blank tuple.
- decompressDims [tuple]: dimensions of the decoder part of the autoencoder. Note that another layer, whose size is equal to the dimension of the input data, is always added. Therefore this can be a blank tuple.
- bnDecay [float]: decay value for the moving average used in Batch Normalization
- l2scale [float]: L2 regularization coefficient for all weights
- lambda_fair [float]: coefficient of the fair regularization term
Return type
None
loadData
Method
loadData(dataPath='')
Loads data from given path and splits it into train and validation sets.
Parameters
- dataPath [str]: absolute path to processed numpy data file
Return type
- trainX, validX, trainz, validz, trainy, validy [np.ndarray]: arrays of split train and validation data arrays of split train and validation data
buildAutoencoder
Method
buildAutoencoder(x_input)
Builds the autoencoder that encodes, compresses and then decompresses the input. Calculates the loss between the decompressed input and the original input.
Parameters
- x_input [tf.Tensor]: tensor of x input data
Return type
- loss [tf.Tensor]: float tensor loss between the decompressed input and the original x input
- decodeVariables [dict]: variable that stores weights and biases of decompressed x input
buildGenerator
Method
buildGenerator(x_input, y_input, z_input, bn_train)
Builds the generator. Generates the x data given the y outcome and z protected attribute during training. Applies multi-head self-attention to x data using MultiHeadSelfAttention
class.
Parameters
- x_input [tf.Tensor]: tensor of x input data
- y_input [tf.Tensor]: tensor of y outcome data
- z_input [tf.Tensor]: tensor of z protected attribute
- bn_train [tf.Tensor]: boolean tensor specifying whether we are in the training phase of generator
Return type
- output [tf.Tensor]: generated x data
buildGeneratorTest
Method
buildGenerator(x_input, y_input, z_input, bn_train)
Builds the generator for post model training. Generates the x data given the y outcome and z protected attribute post model training. Applies multi-head self-attention to x data using MultiHeadSelfAttention
class.
Parameters
- x_input [tf.Tensor]: tensor of x input data
- y_input [tf.Tensor]: tensor of y outcome data
- z_input [tf.Tensor]: tensor of z protected attribute
- bn_train [tf.Tensor]: boolean tensor specifying whether we are in the training phase of generator
Return type
- output [tf.Tensor]: generated x data
getDiscriminatorResults
Method
getDiscriminatorResults(x_input, y_bool, keepRate, z_mask0, z_mask1, z_mask2, z_mask3, reuse=False)
Calculates the discriminator predictions.
Parameters
- x_input [tf.Tensor]: tensor of x input data
- y_bool [tf.Tensor]: boolean tensor representing the boolean value of outcome y
- keepRate [float]: dropout rate of discriminator
- z_mask0 [tf.Tensor]: boolean tensor which is True where the z protected attribute is 0
- z_mask1 [tf.Tensor]: boolean tensor which is True where the z protected attribute is 1
- z_mask2 [tf.Tensor]: boolean tensor which is True where the z protected attribute is 2
- z_mask3 [tf.Tensor]: boolean tensor which is True where the z protected attribute is 3
- reuse [bool]: whether or not to reuse TensorFlow variables
Return type
- f_hat [tf.Tensor]: probabilities of generated x data being real/fake based the protected attribute z
- y_hat [tf.Tensor]: probabilities of generated y classified by classifier based on the generated x data
- z_hat [tf.Tensor]: probabilities of protected attribute z based based on generated y
- g_hat [tf.Tensor]: probabilities of generated x input
buildDiscriminator
Method
buildDiscriminator(x_real, y_real, x_fake, y_fake, yb_real, yb_fake, keepRate, decodeVariables, z_r_mask0, z_r_mask1, z_r_mask2, z_r_mask3, z_r_mask4, z_f_mask0, z_f_mask1, z_f_mask2, z_f_mask3, z_f_mask4)
Builds the discriminator.
Parameters
- x_real [tf.Tensor]: tensor of x real input
- y_real [tf.Tensor]: tensor of y real outcome
- x_fake [tf.Tensor]: tensor of x fake input
- y_fake [tf.Tensor]: tensor of y fake outcome
- yb_real [tf.Tensor]: boolean tensor which is True where the real y real outcome is 0
- yb_fake [tf.Tensor]: boolean tensor which is True where the real y fake outcome is 0
- keepRate [float]: dropout rate of discriminator
- decodeVariables [dict]: variable that stores weights and biases of decompressed x input
- z_r_mask0 [tf.Tensor]: boolean tensor which is True where the real z protected attribute is 0
- z_r_mask1 [tf.Tensor]: boolean tensor which is True where the real z protected attribute is 1
- z_r_mask2 [tf.Tensor]: boolean tensor which is True where the real z protected attribute is 2
- z_r_mask3 [tf.Tensor]: boolean tensor which is True where the real z protected attribute is 3
- z_r_mask4 [tf.Tensor]: boolean tensor which is True where the real z protected attribute is 4
- z_f_mask0 [tf.Tensor]: boolean tensor which is True where the fake z protected attribute is 0
- z_f_mask1 [tf.Tensor]: boolean tensor which is True where the fake z protected attribute is 1
- z_f_mask2 [tf.Tensor]: boolean tensor which is True where the fake z protected attribute is 2
- z_f_mask3 [tf.Tensor]: boolean tensor which is True where the fake z protected attribute is 3
- z_f_mask4 [tf.Tensor]: boolean tensor which is True where the fake z protected attribute is 4
Return type
- tensors [tf.Tensors]: decoded/predicted x, losses and probabilities of real and fake variables f, y, z, and g
print2file
Method
print2file(buf, outFile)
Writes training metrics to log file.
Parameters
- buf [str]: data to write to file
- outFile [str]: file path to model output
Return type
None
generateData
Method
generateData(nSamples=100,modelFile='model',batchSize=100, outFile='out', p_z=[], p_y=[])
Generates less-biased data using trained model and save to output path specified.
Parameters
- nSamples [int]: size of entire original dataset
- modelFile [str]: path to trained Fair Transformer GAN model
- batchSize [int]: size each batch
- outFile [str]: path to generated data files in numpy format
- p_z [list]: probability distribution of protected attribute
- p_y [list]: probability distribution of outcome
Return type
None
calculateDiscAuc
Method
calculateDiscAuc(preds_real, preds_fake)
Calculates discriminator AUC from real and fake predictions.
Parameters
- preds_real [numpy.ndarray]: array of real predictions
- preds_fake [numpy.ndarray]: array of fake predictions
Return type
- auc [float]: discriminator AUC
calculateDiscAccuracy
Method
calculateDiscAccuracy(preds_real, preds_fake)
Calculates discriminator accuracy from real and fake predictions.
Parameters
- preds_real [numpy.ndarray]: array of real predictions
- preds_fake [numpy.ndarray]: array of fake predictions
Return type
- acc [float]: discriminator accuracy
calculateGenAccuracy
Method
calculateGenAccuracy(preds_real, preds_fake)
Calculates generator accuracy from real and fake predictions.
Parameters
- preds_real [numpy.ndarray]: array of real predictions
- preds_fake [numpy.ndarray]: array of fake predictions
Return type
- acc [float]: generator accuracy
pair_rd
Method
pair_rd(y_real, z_real)
Helper function to calculate total pairwise risk difference across all z protected attribute classes.
Parameters
- y_real [numpy.ndarray]: array of y outcome values
- z_real [numpy.ndarray]: array of z protected attribute values
Return type
- risk_diff [float]: total risk difference score across all z protected attribute classes.
calculateRD
Method
calculateRD(y_real, z_real)
Calculates risk difference score across all z protected attribute classes during training. Calls pair_rd() function.
Parameters
- y_real [numpy.ndarray]: array of original y outcome values
- z_real [numpy.ndarray]: array of original z protected attribute values
Return type
- risk_diff [float]: total risk difference score across all z protected attribute classes
calculateClassifierAccuracy
Method
calculateClassifierAccuracy(preds_real, y_real)
Calculates classifier accuracy between real y outcome and predicted y.
Parameters
- preds_real [numpy.ndarray]: array of predicted y based on x data generated from real x data
- y_real [numpy.ndarray]: array of original y outcome values
Return type
- acc [float]: classider accuracy
calculateClassifierRD
Method
calculateClassifierRD(preds_real, z_real, yreal)
Calculate classifier risk difference score across all z protected attribute classes during training.
Parameters
- preds_real [numpy.ndarray]: array of predicted y based on x data generated from real x data
- z_real [numpy.ndarray]: array of original z protected attribute values
- y_real [numpy.ndarray]: array of original y outcome values
Return type
- rd [float]: total risk difference score across all z protected attribute classes
- rd1 [float]: risk difference score across all z protected attribute classes when y outcome = 1
- rd0 [float]: risk difference score across all z protected attribute classes when y outcome = 0
create_z_masks
Method
create_z_masks(z_arr)
Create a z_mask for each class (max 5) of protected attribute in z array. Each boolean mask is True at each index it exists in the z array.
Parameters
- z_arr [numpy.ndarray]: array of z protected attribute values
Return type
- z_mask0 [numpy.ndarray]: array of z_mask for protected attribute class 0
- z_mask1 [numpy.ndarray]: array of z_mask for protected attribute class 1
- z_mask2 [numpy.ndarray]: array of z_mask for protected attribute class 2
- z_mask3 [numpy.ndarray]: array of z_mask for protected attribute class 3
- z_mask4 [numpy.ndarray]: array of z_mask for protected attribute class 4
train
Method
create_z_masks(dataPath='data',modelPath='',outPath='out',pretrainEpochs=100,nEpochs=300,generatorTrainPeriod=1,discriminatorTrainPeriod=2,pretrainBatchSize=100,batchSize=1000,saveMaxKeep=0, p_z=[], p_y=[])
Train the Fair Transformer GAN model and save it to the output path specified.
Parameters
- dataPath [str]: path to input dataset
- modelPath [str]: path to existing model, if it exists
- outPath [str]: path to store model output and logs
- nEpochs [int]: number of epochs to train the model
- discriminatorTrainPeriod [int]: number of periods to train the discriminator per batch per epoch
- generatorTrainPeriod [int]: number of periods to train the generator per batch per epoch
- pretrainBatchSize [int]: size of pretraining batch
- batchSize [int]: size of training batch
- pretrainEpochs [int: number of epochs to pretrain model
- saveMaxKeep [int]: number of checkpoint files to save
- p_z [list]: probability distribution of protected attribute
- p_y [list]: probability distribution of outcome
Return type
None
MultiHeadSelfAttention class
Transformer multi-head self attention block that is used in generator of FairTransformerGAN. Used by buildGenerator
and buildGeneratorTest
of FairTransformerGAN class.ßß
Function | Description |
---|---|
init | Intializes instance of MultiHeadSelfAttention class |
call | Calculates the attention of the inputted data |
MultiHeadSelfAttention
CLASS
MultiHeadSelfAttention
()
init
Method
__init__(num_heads, input_dim, dropout_rate=0.0)
Initializes multi-head self-attention block.
Parameters
- num_heads [int]: number of heads of self-attention
- input_dim [int]: size of input layer
- dropout_rate [float]: proportion of neurons to drop in a layer
Return type
None
call
Method
call(inputs, mask=None, training=None)
Calculates the mult-head self-attention of the inputs
Parameters
- inputs [tf.Tensor]: input layer
- mask [tf.Tensor]: mask layer
- training [bool]: whether we should add dropout during attention calculation
Return type
- attention_output [tf.Tensor]: self-attention block resized back to input dimension