Model

Fair Transformer GAN class

A class that defines the Fair Transformer GAN model. Processed data from the Dataset object is passed to the Model object, which trains the model and generates less-biased data. The Metrics object then uses this generated data to calculate model performance.

The train folder contains the script to train the model via the command line.

Function	Description
`init`	Intializes instance of FairTransformerGAN class
`loadData`	Load processed data created from Dataset class
`buildAutoencoder`	Build autoencoder that encodes the input data
`buildGenerator`	Build the generator for training
`buildGeneratorTest`	Build the generator when generating new data
`getDiscriminatorResults`	Calculate the discriminator predictions
`buildDiscriminator`	Build the discriminator
`print2file`	Print the training metrics to the log file
`generateData`	Generate new data using the trained model
`calculateDiscAuc`	Calculate discriminator AUC
`calculateDiscAccuracy`	Calculate discriminator accuracy
`calculateGenAccuracy`	Calculate generator accuracy
`pair_rd`	Calculate total pairwise risk difference
`calculateRD`	Calculate risk difference score across all protected attribute classes
`calculateClassifierAccuracy`	Calculate classifier accuracy
`calculateClassifierRD`	Calculate classifier risk difference score across all protected attribute classes
`create_z_masks`	Calculate mask for each protected attribute class
`train`	Train model

FairTransformerGAN

CLASS

FairTransformerGAN()

init

Method

__init__(dataType='binary',inputDim=58,embeddingDim=32,randomDim=32,generatorDims=(32, 32),discriminatorDims=(32, 16, 1),compressDims=(),decompressDims=(),bnDecay=0.99,l2scale=0.001,lambda_fair=1)

Initializes FairTransformerGAN model with given parameters. Based on MedGAN architecture.

Parameters

dataType [str]: specifies if the input data contains only binary (0, 1) values or continuous values
inputDim [int]: number of columns in the input data, not including the protected attribute and outcome columns
embeddingDim [int]: dimension size of the embedding, which will be generated by the generator
randomDim [int]: dimension size of the random noise, on which the generator is conditioned
generatorDims [tuple]: dimensions of the generator. Note that another layer of size “embeddingDim” is always added.
discriminatorDims [tuple]: dimensions of the discriminator
compressDims [tuple]: dimensions of the encoder part of the autoencoder. Note that another layer of size “embeddingDim” is always added. Therefore this can be a blank tuple.
decompressDims [tuple]: dimensions of the decoder part of the autoencoder. Note that another layer, whose size is equal to the dimension of the input data, is always added. Therefore this can be a blank tuple.
bnDecay [float]: decay value for the moving average used in Batch Normalization
l2scale [float]: L2 regularization coefficient for all weights
lambda_fair [float]: coefficient of the fair regularization term

Return type

None

loadData

Method

loadData(dataPath='')

Loads data from given path and splits it into train and validation sets.

Parameters

dataPath [str]: absolute path to processed numpy data file

Return type

trainX, validX, trainz, validz, trainy, validy [np.ndarray]: arrays of split train and validation data arrays of split train and validation data

buildAutoencoder

Method

buildAutoencoder(x_input)

Builds the autoencoder that encodes, compresses and then decompresses the input. Calculates the loss between the decompressed input and the original input.

Parameters

x_input [tf.Tensor]: tensor of x input data

Return type

loss [tf.Tensor]: float tensor loss between the decompressed input and the original x input
decodeVariables [dict]: variable that stores weights and biases of decompressed x input

buildGenerator

Method

buildGenerator(x_input, y_input, z_input, bn_train)

Builds the generator. Generates the x data given the y outcome and z protected attribute during training. Applies multi-head self-attention to x data using MultiHeadSelfAttention class.

Parameters

x_input [tf.Tensor]: tensor of x input data
y_input [tf.Tensor]: tensor of y outcome data
z_input [tf.Tensor]: tensor of z protected attribute
bn_train [tf.Tensor]: boolean tensor specifying whether we are in the training phase of generator

Return type

output [tf.Tensor]: generated x data

buildGeneratorTest

Method

buildGenerator(x_input, y_input, z_input, bn_train)

Builds the generator for post model training. Generates the x data given the y outcome and z protected attribute post model training. Applies multi-head self-attention to x data using MultiHeadSelfAttention class.

Parameters

x_input [tf.Tensor]: tensor of x input data
y_input [tf.Tensor]: tensor of y outcome data
z_input [tf.Tensor]: tensor of z protected attribute
bn_train [tf.Tensor]: boolean tensor specifying whether we are in the training phase of generator

Return type

output [tf.Tensor]: generated x data

getDiscriminatorResults

Method

getDiscriminatorResults(x_input, y_bool, keepRate, z_mask0, z_mask1, z_mask2, z_mask3, reuse=False)

Calculates the discriminator predictions.

Parameters

x_input [tf.Tensor]: tensor of x input data
y_bool [tf.Tensor]: boolean tensor representing the boolean value of outcome y
keepRate [float]: dropout rate of discriminator
z_mask0 [tf.Tensor]: boolean tensor which is True where the z protected attribute is 0
z_mask1 [tf.Tensor]: boolean tensor which is True where the z protected attribute is 1
z_mask2 [tf.Tensor]: boolean tensor which is True where the z protected attribute is 2
z_mask3 [tf.Tensor]: boolean tensor which is True where the z protected attribute is 3
reuse [bool]: whether or not to reuse TensorFlow variables

Return type

f_hat [tf.Tensor]: probabilities of generated x data being real/fake based the protected attribute z
y_hat [tf.Tensor]: probabilities of generated y classified by classifier based on the generated x data
z_hat [tf.Tensor]: probabilities of protected attribute z based based on generated y
g_hat [tf.Tensor]: probabilities of generated x input

buildDiscriminator

Method

buildDiscriminator(x_real, y_real, x_fake, y_fake, yb_real, yb_fake, keepRate, decodeVariables, z_r_mask0, z_r_mask1, z_r_mask2, z_r_mask3, z_r_mask4, z_f_mask0, z_f_mask1, z_f_mask2, z_f_mask3, z_f_mask4)

Builds the discriminator.

Parameters

x_real [tf.Tensor]: tensor of x real input
y_real [tf.Tensor]: tensor of y real outcome
x_fake [tf.Tensor]: tensor of x fake input
y_fake [tf.Tensor]: tensor of y fake outcome
yb_real [tf.Tensor]: boolean tensor which is True where the real y real outcome is 0
yb_fake [tf.Tensor]: boolean tensor which is True where the real y fake outcome is 0
keepRate [float]: dropout rate of discriminator
decodeVariables [dict]: variable that stores weights and biases of decompressed x input
z_r_mask0 [tf.Tensor]: boolean tensor which is True where the real z protected attribute is 0
z_r_mask1 [tf.Tensor]: boolean tensor which is True where the real z protected attribute is 1
z_r_mask2 [tf.Tensor]: boolean tensor which is True where the real z protected attribute is 2
z_r_mask3 [tf.Tensor]: boolean tensor which is True where the real z protected attribute is 3
z_r_mask4 [tf.Tensor]: boolean tensor which is True where the real z protected attribute is 4
z_f_mask0 [tf.Tensor]: boolean tensor which is True where the fake z protected attribute is 0
z_f_mask1 [tf.Tensor]: boolean tensor which is True where the fake z protected attribute is 1
z_f_mask2 [tf.Tensor]: boolean tensor which is True where the fake z protected attribute is 2
z_f_mask3 [tf.Tensor]: boolean tensor which is True where the fake z protected attribute is 3
z_f_mask4 [tf.Tensor]: boolean tensor which is True where the fake z protected attribute is 4

Return type

tensors [tf.Tensors]: decoded/predicted x, losses and probabilities of real and fake variables f, y, z, and g

print2file

Method

print2file(buf, outFile)

Writes training metrics to log file.

Parameters

buf [str]: data to write to file
outFile [str]: file path to model output

Return type

None

generateData

Method

generateData(nSamples=100,modelFile='model',batchSize=100, outFile='out', p_z=[], p_y=[])

Generates less-biased data using trained model and save to output path specified.

Parameters

nSamples [int]: size of entire original dataset
modelFile [str]: path to trained Fair Transformer GAN model
batchSize [int]: size each batch
outFile [str]: path to generated data files in numpy format
p_z [list]: probability distribution of protected attribute
p_y [list]: probability distribution of outcome

Return type

None

calculateDiscAuc

Method

calculateDiscAuc(preds_real, preds_fake)

Calculates discriminator AUC from real and fake predictions.

Parameters

preds_real [numpy.ndarray]: array of real predictions
preds_fake [numpy.ndarray]: array of fake predictions

Return type

auc [float]: discriminator AUC

calculateDiscAccuracy

Method

calculateDiscAccuracy(preds_real, preds_fake)

Calculates discriminator accuracy from real and fake predictions.

Parameters

preds_real [numpy.ndarray]: array of real predictions
preds_fake [numpy.ndarray]: array of fake predictions

Return type

acc [float]: discriminator accuracy

calculateGenAccuracy

Method

calculateGenAccuracy(preds_real, preds_fake)

Calculates generator accuracy from real and fake predictions.

Parameters

preds_real [numpy.ndarray]: array of real predictions
preds_fake [numpy.ndarray]: array of fake predictions

Return type

acc [float]: generator accuracy

pair_rd

Method

pair_rd(y_real, z_real)

Helper function to calculate total pairwise risk difference across all z protected attribute classes.

Parameters

y_real [numpy.ndarray]: array of y outcome values
z_real [numpy.ndarray]: array of z protected attribute values

Return type

risk_diff [float]: total risk difference score across all z protected attribute classes.

calculateRD

Method

calculateRD(y_real, z_real)

Calculates risk difference score across all z protected attribute classes during training. Calls pair_rd() function.

Parameters

y_real [numpy.ndarray]: array of original y outcome values
z_real [numpy.ndarray]: array of original z protected attribute values

Return type

risk_diff [float]: total risk difference score across all z protected attribute classes

calculateClassifierAccuracy

Method

calculateClassifierAccuracy(preds_real, y_real)

Calculates classifier accuracy between real y outcome and predicted y.

Parameters

preds_real [numpy.ndarray]: array of predicted y based on x data generated from real x data
y_real [numpy.ndarray]: array of original y outcome values

Return type

acc [float]: classider accuracy

calculateClassifierRD

Method

calculateClassifierRD(preds_real, z_real, yreal)

Calculate classifier risk difference score across all z protected attribute classes during training.

Parameters

preds_real [numpy.ndarray]: array of predicted y based on x data generated from real x data
z_real [numpy.ndarray]: array of original z protected attribute values
y_real [numpy.ndarray]: array of original y outcome values

Return type

rd [float]: total risk difference score across all z protected attribute classes
rd1 [float]: risk difference score across all z protected attribute classes when y outcome = 1
rd0 [float]: risk difference score across all z protected attribute classes when y outcome = 0

create_z_masks

Method

create_z_masks(z_arr)

Create a z_mask for each class (max 5) of protected attribute in z array. Each boolean mask is True at each index it exists in the z array.

Parameters

z_arr [numpy.ndarray]: array of z protected attribute values

Return type

z_mask0 [numpy.ndarray]: array of z_mask for protected attribute class 0
z_mask1 [numpy.ndarray]: array of z_mask for protected attribute class 1
z_mask2 [numpy.ndarray]: array of z_mask for protected attribute class 2
z_mask3 [numpy.ndarray]: array of z_mask for protected attribute class 3
z_mask4 [numpy.ndarray]: array of z_mask for protected attribute class 4

train

Method

create_z_masks(dataPath='data',modelPath='',outPath='out',pretrainEpochs=100,nEpochs=300,generatorTrainPeriod=1,discriminatorTrainPeriod=2,pretrainBatchSize=100,batchSize=1000,saveMaxKeep=0, p_z=[], p_y=[])

Train the Fair Transformer GAN model and save it to the output path specified.

Parameters

dataPath [str]: path to input dataset
modelPath [str]: path to existing model, if it exists
outPath [str]: path to store model output and logs
nEpochs [int]: number of epochs to train the model
discriminatorTrainPeriod [int]: number of periods to train the discriminator per batch per epoch
generatorTrainPeriod [int]: number of periods to train the generator per batch per epoch
pretrainBatchSize [int]: size of pretraining batch
batchSize [int]: size of training batch
pretrainEpochs [int: number of epochs to pretrain model
saveMaxKeep [int]: number of checkpoint files to save
p_z [list]: probability distribution of protected attribute
p_y [list]: probability distribution of outcome

Return type

None

MultiHeadSelfAttention class

Transformer multi-head self attention block that is used in generator of FairTransformerGAN. Used by buildGenerator and buildGeneratorTest of FairTransformerGAN class.ßß

Function	Description
`init`	Intializes instance of MultiHeadSelfAttention class
`call`	Calculates the attention of the inputted data

MultiHeadSelfAttention

CLASS

MultiHeadSelfAttention()

init

Method

__init__(num_heads, input_dim, dropout_rate=0.0)

Initializes multi-head self-attention block.

Parameters

num_heads [int]: number of heads of self-attention
input_dim [int]: size of input layer
dropout_rate [float]: proportion of neurons to drop in a layer

Return type

None

call

Method

call(inputs, mask=None, training=None)

Calculates the mult-head self-attention of the inputs

Parameters

inputs [tf.Tensor]: input layer
mask [tf.Tensor]: mask layer
training [bool]: whether we should add dropout during attention calculation

Return type

attention_output [tf.Tensor]: self-attention block resized back to input dimension