Title: | Data Science Looks at Discrimination |
---|---|
Description: | Statistical and graphical tools for detecting and measuring discrimination and bias, be it racial, gender, age or other. Detection and remediation of bias in machine learning algorithms. 'Python' interfaces available. |
Authors: | Norm Matloff [aut, cre] , Taha Abdullah [aut], Arjun Ashok [aut], Shubhada Martha [aut], Aditya Mittal [aut], Billy Ouattara [aut], Jonathan Tran [aut], Brandon Zarate Estrada [aut] |
Maintainer: | Norm Matloff <[email protected]> |
License: | GPL (>= 2) |
Version: | 0.2.2 |
Built: | 2024-11-26 05:56:57 UTC |
Source: | https://github.com/matloff/dsld |
A collection of criminal offenders screened in Florida (US) during 2013-14. This data was used to predict recidivism.
Additional details for this dataset can be found via the fairml package.
Wrappers for functions in the bnlearn package. (Just
(Presently, just iamb
.)
dsldIamb(data)
dsldIamb(data)
data |
Data frame. |
Under very stringent assumptions, dsldIamb
performs causal
discovery, i.e. fits a causal model to data
.
Object of class 'bn' (bnlearn object). The generic plot
function is callable on this object.
N. Matloff
data(svcensus) # iamb does not accept integer data svcensus$wkswrkd <- as.numeric(svcensus$wkswrkd) svcensus$wageinc <- as.numeric(svcensus$wageinc) iambOut <- dsldIamb(svcensus) plot(iambOut)
data(svcensus) # iamb does not accept integer data svcensus$wkswrkd <- as.numeric(svcensus$wkswrkd) svcensus$wageinc <- as.numeric(svcensus$wageinc) iambOut <- dsldIamb(svcensus) plot(iambOut)
Confounder hunting: searches for variables C that predict both Y and S. Proxy hunting: searches for variables O that predict S.
dsldCHunting(data,yName,sName,intersectDepth=10) dsldOHunting(data,yName,sName)
dsldCHunting(data,yName,sName,intersectDepth=10) dsldOHunting(data,yName,sName)
data |
Data frame. |
yName |
Name of the response variable column. |
sName |
Name of the sensitive attribute column. |
intersectDepth |
Maximum size of intersection of the Y predictor set and the S predictor set |
dsldCHunting
: The random forests function
qeML:qeRF
will be run on the indicated data to indicate feature
importance in prediction of Y (without S) and S (without Y). Call
these "important predictors" of Y and S.
Then for each i
from 1 to intersectDepth
, the
intersection of the top i
important predictors of Y and the
the top i
important predictors of S will be reported, thus
suggesting possible confounders. Larger values of i
will
report more potential confounders, though including progressively
weaker ones.
The analyst then may then consider omitting the variables C from models of the effect of S on Y.
Note: Run times may be long.
dsldOHunting
: Factors, if any, will be converted to dummy
variables, and then the Kendall Tau correlations will be calculated
betwene S and potential proxy variables O, i.e. every column other
than Y and S. (The Y column itself doesn't enter into computation.)
In fairness analyses, in which one desires to either eliminate or reduce the impact of S, one must consider the indirect effect of S via O. One may wish to eliminate or reduce the role of O.
The function dsldCHunting
returns an R list, one component for
each confounder set found.
The function dsldOHunting
returns an R matrix of correlations,
one row for each level of S.
N. Matloff
data(lsa) dsldCHunting(lsa,'bar','race1') # e.g. suggests confounders 'decile3', 'lsat' data(mortgageSE) dsldOHunting(mortgageSE,'deny','black') # e.g. suggests using loan value and condo purchase as proxies
data(lsa) dsldCHunting(lsa,'bar','race1') # e.g. suggests confounders 'decile3', 'lsat' data(mortgageSE) dsldOHunting(mortgageSE,'deny','black') # e.g. suggests using loan value and condo purchase as proxies
Plots (estimated) mean Y against X, separately for each level of S,
with restrictions condits
. May reveal Simpson's Paradox-like
differences not seen in merely plotting mean Y against X.
dsldConditDisparity(data, yName, sName, xName, condits = NULL, qeFtn = qeKNN, minS = 50, useLoess = TRUE)
dsldConditDisparity(data, yName, sName, xName, condits = NULL, qeFtn = qeKNN, minS = 50, useLoess = TRUE)
data |
Data frame or equivalent. |
yName |
Name of predicted variable Y. Must be numeric or dichtomous R factor. |
sName |
Name of the sensitive variable S, an R factor |
xName |
Name of a numeric column for the X-axis. |
condits |
An R vector; each component is a character
string for an R logical expression representing a desired
condition involving |
qeFtn |
|
minS |
Minimum size for an S group to be retained in the analysis. |
useLoess |
If TRUE, do loess smoothing on the fitted regression values. |
No value; plot.
N. Matloff, A. Ashok, S. Martha, A. Mittal
data(compas) # graph probability of recidivism by race given age, among those with at # most 4 prior convictions and COMPAS decile score at least 6 compas$two_year_recid <- as.numeric(compas$two_year_recid == "Yes") dsldConditDisparity(compas,"two_year_recid", "race", "age", c("priors_count <= 4","decile_score>=6"), qeKNN) dsldConditDisparity(compas,"two_year_recid", "race", "age", "priors_count == 0", qeGBoost)
data(compas) # graph probability of recidivism by race given age, among those with at # most 4 prior convictions and COMPAS decile score at least 6 compas$two_year_recid <- as.numeric(compas$two_year_recid == "Yes") dsldConditDisparity(compas,"two_year_recid", "race", "age", c("priors_count <= 4","decile_score>=6"), qeKNN) dsldConditDisparity(compas,"two_year_recid", "race", "age", "priors_count == 0", qeGBoost)
Plots estimated densities of all continuous features X, conditioned on a specified categorical feature C.
dsldConfounders(data, sName, graphType = "plotly", fill = FALSE)
dsldConfounders(data, sName, graphType = "plotly", fill = FALSE)
data |
Dataframe, at least 2 columns. |
sName |
Name of the categorical column, an R factor. In discrimination contexts, Typically a sensitive variable. |
graphType |
Either "plot" or "plotly", for static or interactive graphs. The latter requires the plotly package. |
fill |
Only applicable to graphType = "plot" case. Setting to true will color each line down to the x-axis. |
No value; plot.
N. Matloff, T. Abdullah, A. Ashok, J. Tran
data(svcensus) dsldConfounders(svcensus, "educ")
data(svcensus) dsldConfounders(svcensus, "educ")
Graphs densities of a response variable, grouped by a sensitive variable.
Similar to dsldConfounders
, but includes sliders to control the
bandwidth of the density estimate (analogous to controlling the bin
width in a histogram).
dsldDensityByS(data, cName, sName, graphType = "plotly", fill = FALSE)
dsldDensityByS(data, cName, sName, graphType = "plotly", fill = FALSE)
data |
Datasetwith at least 1 numerical column and 1 factor column |
cName |
Possible confounding variable column, an R numeric |
sName |
Name of the sensitive variable column, an R factor |
graphType |
Type of graph created. Defaults to "plotly". |
fill |
To fill the graph. Defaults to "FALSE". |
No value; plot.
N. Matloff, T. Abdullah, A. Ashok, J. Tran
data(svcensus) dsldDensityByS(svcensus, cName = "wageinc", sName = "educ")
data(svcensus) dsldDensityByS(svcensus, cName = "wageinc", sName = "educ")
Explicitly Deweighted Features: control the effect of proxies related to sensitive variables for prediction.
dsldQeFairKNN(data, yName, sNames, deweightPars=NULL, yesYVal=NULL,k=25, scaleX=TRUE, holdout=floor(min(1000,0.1*nrow(data)))) dsldQeFairRF(data,yName,sNames,deweightPars=NULL, nTree=500, minNodeSize=10, mtry = floor(sqrt(ncol(data))),yesYVal=NULL, holdout=floor(min(1000,0.1*nrow(data)))) dsldQeFairRidgeLin(data, yName, sNames, deweightPars = NULL, holdout=floor(min(1000,0.1*nrow(data)))) dsldQeFairRidgeLog(data, yName, sNames, deweightPars = NULL, holdout = floor(min(1000, 0.1 * nrow(data))), yesYVal = levels(data[, yName])[2]) ## S3 method for class 'dsldQeFair' predict(object,newx,...)
dsldQeFairKNN(data, yName, sNames, deweightPars=NULL, yesYVal=NULL,k=25, scaleX=TRUE, holdout=floor(min(1000,0.1*nrow(data)))) dsldQeFairRF(data,yName,sNames,deweightPars=NULL, nTree=500, minNodeSize=10, mtry = floor(sqrt(ncol(data))),yesYVal=NULL, holdout=floor(min(1000,0.1*nrow(data)))) dsldQeFairRidgeLin(data, yName, sNames, deweightPars = NULL, holdout=floor(min(1000,0.1*nrow(data)))) dsldQeFairRidgeLog(data, yName, sNames, deweightPars = NULL, holdout = floor(min(1000, 0.1 * nrow(data))), yesYVal = levels(data[, yName])[2]) ## S3 method for class 'dsldQeFair' predict(object,newx,...)
data |
Dataframe, training set. |
yName |
Name of the response variable column. |
sNames |
Name(s) of the sensitive attribute column(s). |
deweightPars |
Values for de-emphasizing variables in a split, e.g. 'list(age=0.2,gender=0.5)'. In the linear case, larger values means more deweighting, i.e. less influence of the given variable on predictions. For KNN and random forests, smaller values mean more deweighting. |
scaleX |
Scale the features. Defaults to TRUE. |
yesYVal |
Y value to be considered "yes," to be coded 1 rather than 0. |
k |
Number of nearest neighbors. In functions other than
|
holdout |
How many rows to use as the holdout/testing set. Can be NULL. The testing set is used to calculate s correlation and test accuracy. |
nTree |
Number of trees. |
minNodeSize |
Minimum number of data points in a tree node. |
mtry |
Number of variables randomly tried at each split. |
object |
An object returned by the dsld-EDFFAIR wrapper. |
newx |
New data to be predicted. Must be in the same format as original data. |
... |
Further arguments. |
The sensitive variables S are removed entirely, but there is concern that they still affect prediction indirectly, via a set C of proxy variables.
Linear EDF reduces the impact of the proxies through a shinkage process similar to that of ridge regression. Specifically, instead of minimizing the sum of squared errors SSE with respect to a coefficient vector b, we minimize SSE + the squared norm of Db, where D is a diagonal matrix with nonzero elements corresponding to C. Large values penalizing variables in C, thus shrinking them.
KNN EDF reduces the weights in Euclidean distance for variables in C. The random forests version reduces the probabilities that a proxy will be used in splitting a node.
By using various values of the deweighting parameters, the user can choose a desired position in the Fairness-Utility Tradeoff.
More details can be found in the references.
The EDF functions return objects of class 'dsldQeFair', which include components for test and base accuracy, summaries of inputs and so on.
N. Matloff, A. Mittal, J. Tran
https://github.com/matloff/EDFfair
Matloff, Norman, and Wenxi Zhang. "A novel regularization approach to fair ML." arXiv preprint arXiv:2208.06557
(2022).
data(compas1) data(svcensus) # dsldQeFairKNN: deweight "decile score" column with "race" as # the sensitive variable knnOut <- dsldQeFairKNN(compas1, "two_year_recid", "race", list(decile_score=0.1), yesYVal = "Yes") knnOut$testAcc knnOut$corrs predict(knnOut, compas1[1,-8]) # dsldFairRF: deweight "decile score" column with "race" as sensitive variable rfOut <- dsldQeFairRF(compas1, "two_year_recid", "race", list(decile_score=0.3), yesYVal = "Yes") rfOut$testAcc rfOut$corrs predict(rfOut, compas1[1,-8]) # dsldQeFairRidgeLin: deweight "occupation" and "age" columns lin <- dsldQeFairRidgeLin(svcensus, "wageinc", "gender", deweightPars = list(occ=.4, age=.2)) lin$testAcc lin$corrs predict(lin, svcensus[1,-4]) # dsldQeFairRidgeLin: deweight "decile score" column log <- dsldQeFairRidgeLog(compas1, "two_year_recid", "race", list(decile_score=0.1), yesYVal = "Yes") log$testAcc log$corrs predict(log, compas1[1,-8])
data(compas1) data(svcensus) # dsldQeFairKNN: deweight "decile score" column with "race" as # the sensitive variable knnOut <- dsldQeFairKNN(compas1, "two_year_recid", "race", list(decile_score=0.1), yesYVal = "Yes") knnOut$testAcc knnOut$corrs predict(knnOut, compas1[1,-8]) # dsldFairRF: deweight "decile score" column with "race" as sensitive variable rfOut <- dsldQeFairRF(compas1, "two_year_recid", "race", list(decile_score=0.3), yesYVal = "Yes") rfOut$testAcc rfOut$corrs predict(rfOut, compas1[1,-8]) # dsldQeFairRidgeLin: deweight "occupation" and "age" columns lin <- dsldQeFairRidgeLin(svcensus, "wageinc", "gender", deweightPars = list(occ=.4, age=.2)) lin$testAcc lin$corrs predict(lin, svcensus[1,-4]) # dsldQeFairRidgeLin: deweight "decile score" column log <- dsldQeFairRidgeLog(compas1, "two_year_recid", "race", list(decile_score=0.1), yesYVal = "Yes") log$testAcc log$corrs predict(log, compas1[1,-8])
Fair machine learning models: estimation and prediction. The following functions provide wrappers for some functions in the fairML package.
dsldFrrm(data, yName, sName, unfairness, definition = "sp-komiyama", lambda = 0, save.auxiliary = FALSE) dsldFgrrm(data, yName, sName, unfairness, definition = "sp-komiyama", family = "binomial", lambda = 0, save.auxiliary = FALSE) dsldNclm(data, yName, sName, unfairness, covfun = cov, lambda = 0, save.auxiliary = FALSE) dsldZlm(data, yName, sName, unfairness) dsldZlrm(data, yName, sName, unfairness)
dsldFrrm(data, yName, sName, unfairness, definition = "sp-komiyama", lambda = 0, save.auxiliary = FALSE) dsldFgrrm(data, yName, sName, unfairness, definition = "sp-komiyama", family = "binomial", lambda = 0, save.auxiliary = FALSE) dsldNclm(data, yName, sName, unfairness, covfun = cov, lambda = 0, save.auxiliary = FALSE) dsldZlm(data, yName, sName, unfairness) dsldZlrm(data, yName, sName, unfairness)
data |
Data frame. |
yName |
Name of the response variable column. |
sName |
Name(s) of the sensitive attribute column(s). |
unfairness |
A number in (0, 1]. Degree of unfairness allowed in the model. A value (very near) 0 means the model is completely fair, while a value of 1 means the model is not constrained to be fair at all. |
covfun |
A function computing covariance matrices. |
definition |
Character string, the label of the definition of fairness. Currently either 'sp-komiyama', 'eo-komiyama' or 'if-berk'. |
family |
A character string, either 'gaussian' to fit linear regression, 'binomial' for logistic regression, 'poisson' for log-linear regression, 'cox' for Cox proportional hazards regression, or 'multinomial' for multinomial logistic regression. |
lambda |
Non-negative number, a ridge-regression penalty coefficient. |
save.auxiliary |
A logical value, whether to save the fitted values and the residuals of the auxiliary model that constructs the debiased predictors. |
See documentation for the fairml package.
An object of class 'dsldFairML', which includes the model
information, yName
, and sName
.
S. Martha, A. Mittal, B. Ouattara, B. Zarate, J. Tran
data(svcensus) data(compas1) yName <- "wageinc" sName <- "age" frrmOut <- dsldFrrm(svcensus, yName, sName, 0.2, definition = "sp-komiyama") summary(frrmOut) predict(frrmOut, svcensus[1:10,]) yName <- "two_year_recid" sName <- "age" fgrrmOut <- dsldFgrrm(compas1, yName, sName, 0.2, definition = "sp-komiyama") summary(fgrrmOut) predict(fgrrmOut, compas1[c(1:10),])
data(svcensus) data(compas1) yName <- "wageinc" sName <- "age" frrmOut <- dsldFrrm(svcensus, yName, sName, 0.2, definition = "sp-komiyama") summary(frrmOut) predict(frrmOut, svcensus[1:10,]) yName <- "two_year_recid" sName <- "age" fgrrmOut <- dsldFgrrm(compas1, yName, sName, 0.2, definition = "sp-komiyama") summary(fgrrmOut) predict(fgrrmOut, compas1[c(1:10),])
Exploration of the Fairness-Utility Tradeoff. Finds predictive accuracy and correlation between S and predicted Y.
dsldFairUtilTrade(data,yName,sName,dsldFtnName, unfairness=NULL,deweightPars=NULL,yesYVal=NULL,yesSVal=NULL, corrType='kendall', holdout = floor(min(1000, 0.1 * nrow(data))))
dsldFairUtilTrade(data,yName,sName,dsldFtnName, unfairness=NULL,deweightPars=NULL,yesYVal=NULL,yesSVal=NULL, corrType='kendall', holdout = floor(min(1000, 0.1 * nrow(data))))
data |
Data frame. |
yName |
Name of the response variable Y column. Y must be numeric or binary (two-level R factor). |
sName |
Name of the sensitive attribute S column. S must be numeric or binary (two-level R factor). |
dsldFtnName |
Quoted name of one of the fairML or EDF functions. |
unfairness |
Nonnull for the fairML functions. |
deweightPars |
Nonnull for the EDF functions. |
yesYVal |
Y value to be treated as Y = 1 for binary Y. |
yesSVal |
S value to be treated as S = 1 for binary S. |
corrType |
Either 'kendall' or 'probs'. |
holdout |
Size of holdout set. |
Tool for exploring tradeoff between utility (predictive accuracy, Mean Absolute Prediction Error or overall probability of misclassification) and fairness. Roughly speaking, the latter is defined as the strength of relation between S and predicted Y (the smaller, the better). The main issue is definition of "relation" in the case of binary Y or S:
In the 'kendall' case, binary predicted Y or S is recoded to 1s and 0s, and Kendall correlation is used. In the 'probs' case, binary Y or S is replaced by P(Y = 1 | X) and P(S = 1 | X); squared Pearson correlation is then computed.
A two-component vector, consisting of predictive accuracy and strength of relation between S and predicted Y.
N. Matloff
data(svcensus) dsldFairUtilTrade(svcensus,'wageinc','gender','dsldFrrm',0.2,yesSVal='male') data(lsa) race1 <- lsa$race1 lsabw <- lsa[race1 == 'black' | race1 == 'white',] # need to get rid of excess levels race1 <- lsabw$race1 race1 <- as.character(race1) lsabw$race1 <- as.factor(race1) dsldFairUtilTrade(lsabw,'bar','race1','dsldQeFairRidgeLog', deweightPars=list(fam_inc=0.1),yesYVal='TRUE',yesSVal='white')
data(svcensus) dsldFairUtilTrade(svcensus,'wageinc','gender','dsldFrrm',0.2,yesSVal='male') data(lsa) race1 <- lsa$race1 lsabw <- lsa[race1 == 'black' | race1 == 'white',] # need to get rid of excess levels race1 <- lsabw$race1 race1 <- as.character(race1) lsabw$race1 <- as.factor(race1) dsldFairUtilTrade(lsabw,'bar','race1','dsldQeFairRidgeLog', deweightPars=list(fam_inc=0.1),yesYVal='TRUE',yesSVal='white')
Wrapper for the freqparcoord
function from the freqparcoord
package.
dsldFreqPCoord(data, m, sName = NULL, method = "maxdens", faceting = "vert", k = 50, klm = 5 * k, keepidxs = NULL, plotidxs = FALSE, cls = NULL, plot_filename = NULL)
dsldFreqPCoord(data, m, sName = NULL, method = "maxdens", faceting = "vert", k = 50, klm = 5 * k, keepidxs = NULL, plotidxs = FALSE, cls = NULL, plot_filename = NULL)
data |
Data frame or matrix. |
m |
Number of lines to plot for each group. A negative value in conjunction
with the method |
sName |
Column for the grouping variable, if any (if none, all the data
is treated as a single group); the column must be a vector or factor.
The column must not be in |
method |
What to display: 'maxdens' for plotting the most (or least) typical lines, 'locmax' for cluster hunting, or 'randsamp' for plotting a random sample of lines. |
faceting |
How to display groups, if present. Use 'vert' for vertical stacking of group plots, 'horiz' for horizontal ones, or 'none' to draw all lines in one plot, color-coding by group. |
k |
Number of nearest neighbors to use for density estimation. |
klm |
If method is "locmax", number of nearest neighbors to
use for finding local maxima for cluster hunting. Generally needs
to be much larger than |
keepidxs |
If not NULL, the indices of the rows of |
plotidxs |
If TRUE, lines in the display will be annotated
with their case numbers, i.e. their row numbers within |
cls |
Cluster, if any (see the |
plot_filename |
Name of the file that will hold the saved graph image. If NULL, the graph will be generated and displayed without being saved. If a filename is provided, the graph will not be displayed, only saved. |
The dsldFreqPCoord
function wraps freqparcoord
,
which uses a frequency-based parallel coordinates method to
vizualize multiple variables simultaneously in graph form.
This is done by plotting either the "most typical" or "least typical" (i.e. highest or lowest estimated multivariate density values respectively) cases to discern relations between variables.
The Y-axis represents the centered and scaled values of the columns.
Object of type 'gg' (ggplot2 object), with components idxs
and xdisp
added if keepidxs
is not NULL (see argument
keepidxs
above).
N. Matloff, T. Abdullah, B. Ouattara, J. Tran, B. Zarate
https://cran.r-project.org/web/packages/freqparcoord/index.html
data(lsa) lsa1 <- lsa[,c('fam_inc','ugpa','gender','lsat','race1')] dsldFreqPCoord(lsa1,75,'race1') # a number of interesting trends among the most "typical" law students in the # dataset: remarkably little variation among typical # African-Americans; typical Hispanic men have low GPAs, poor LSAT # scores there is more variation; typical Asian and Black students were # female; Asians and Hispanics have the most variation in family income # background
data(lsa) lsa1 <- lsa[,c('fam_inc','ugpa','gender','lsat','race1')] dsldFreqPCoord(lsa1,75,'race1') # a number of interesting trends among the most "typical" law students in the # dataset: remarkably little variation among typical # African-Americans; typical Hispanic men have low GPAs, poor LSAT # scores there is more variation; typical Asian and Black students were # female; Asians and Hispanics have the most variation in family income # background
Informal assessment of C as a possible confounder in a relationship between a sensitive variable S and a variable Y.
dsldFrequencyByS(data, cName, sName)
dsldFrequencyByS(data, cName, sName)
data |
Data frame or equivalent. |
cName |
Name of the "C" column, an R factor. |
sName |
Name of the sensitive variable column, an R factor |
Essentially an informal assessment of the between S and C.
Consider the svcensus
dataset. If for instance we are studying
the effect of gender S on wage income Y, say C is occupation. If
different genders have different occupation patterns, then C is a
potential confounder. (Y does not explicitly appear here.)
Data frame, one for each level of the sensitive variable S, and one column for each level of the confounder C. Each row sums to 1.0.
N. Matloff, T. Abdullah, A. Ashok, J. Tran
data(svcensus) dsldFrequencyByS(svcensus, cName = "educ", sName = "gender") # not much difference in education between genders dsldFrequencyByS(svcensus, cName = "occ", sName = "gender") # substantial difference in occupation between genders data(lsa) lsa$faminc <- as.factor(lsa$fam_inc) dsldFrequencyByS(lsa,'faminc','race1') # distribution of family income by race
data(svcensus) dsldFrequencyByS(svcensus, cName = "educ", sName = "gender") # not much difference in education between genders dsldFrequencyByS(svcensus, cName = "occ", sName = "gender") # substantial difference in occupation between genders data(lsa) lsa$faminc <- as.factor(lsa$fam_inc) dsldFrequencyByS(lsa,'faminc','race1') # distribution of family income by race
Comparison of sensitive groups via linear models, with or without interactions with the sensitive variable.
dsldLinear(data, yName, sName, interactions = FALSE, sComparisonPts = NULL, useSandwich = FALSE) ## S3 method for class 'dsldLM' summary(object,...) ## S3 method for class 'dsldLM' predict(object,xNew,...) ## S3 method for class 'dsldLM' coef(object,...) ## S3 method for class 'dsldLM' vcov(object,...)
dsldLinear(data, yName, sName, interactions = FALSE, sComparisonPts = NULL, useSandwich = FALSE) ## S3 method for class 'dsldLM' summary(object,...) ## S3 method for class 'dsldLM' predict(object,xNew,...) ## S3 method for class 'dsldLM' coef(object,...) ## S3 method for class 'dsldLM' vcov(object,...)
data |
Data frame. |
yName |
Name of the response variable Y column. |
sName |
Name of the sensitive attribute S column. |
interactions |
Logical value indicating whether or not to model interactions with the sensitive variable S. |
sComparisonPts |
If |
useSandwich |
If TRUE, use the "sandwich" variance estimator. |
object |
An object returned by the |
xNew |
New data to be predicted. Must be in the same format as original data. |
... |
Further arguments. |
The dsldLinear
function fits a linear model to the response
variable Y using all other variables in data
. The user may
select for interactions with the sensitive variable S.
The function produces an instance of the 'dsldLM' class (an S3
object). Instances of the generic functions summary
and
coef
are provided.
If interactions
is TRUE, the function will fit m separate
models, where m is the number of levels of S. Then summary
will contain m+1 data frames; the first m of which will be the
outputs from the individual models.
The m+1st data frame will compare the differences
in conditional mean Y|X for each pair of S levels, and for each
value of X in sComparisonPts
.
The intention is to allow users to see the comparisons
of conditions for sensitive groups via linear models, with
interactions with S.
The dsldDiffS
function allows users to compare mean Y at that
X between each pair of S level for additional new unseen data levels
using the model fitted from dsldLinear
.
The dsldLinear
function returns an S3 object of class 'dsldLM',
with one component for each level of S. Each component includes
information about the fitted model.
N. Matloff, A. Mittal, A. Ashok
data(svcensus) newData <- svcensus[c(1, 18), -c(4,6)] lin1 <- dsldLinear(svcensus, 'wageinc', 'gender', interactions = TRUE, newData) coef(lin1) vcov(lin1) summary(lin1) predict(lin1, newData) lin2 <- dsldLinear(svcensus, 'wageinc', 'gender', interactions = FALSE) summary(lin2)
data(svcensus) newData <- svcensus[c(1, 18), -c(4,6)] lin1 <- dsldLinear(svcensus, 'wageinc', 'gender', interactions = TRUE, newData) coef(lin1) vcov(lin1) summary(lin1) predict(lin1, newData) lin2 <- dsldLinear(svcensus, 'wageinc', 'gender', interactions = FALSE) summary(lin2)
Comparison of conditions for sensitive groups via logistic regression models, with or without interactions with the sensitive variable.
dsldLogit(data, yName, sName, sComparisonPts = NULL, interactions = FALSE, yesYVal) ## S3 method for class 'dsldGLM' summary(object,...) ## S3 method for class 'dsldGLM' predict(object,xNew,...) ## S3 method for class 'dsldGLM' coef(object,...) ## S3 method for class 'dsldGLM' vcov(object,...)
dsldLogit(data, yName, sName, sComparisonPts = NULL, interactions = FALSE, yesYVal) ## S3 method for class 'dsldGLM' summary(object,...) ## S3 method for class 'dsldGLM' predict(object,xNew,...) ## S3 method for class 'dsldGLM' coef(object,...) ## S3 method for class 'dsldGLM' vcov(object,...)
data |
Data frame used to train the linear model; will be split according to
each level of |
yName |
Name of the response variable column. |
sName |
Name of the sensitive attribute column. |
interactions |
If TRUE, fit interactions with the sensitive variable. |
sComparisonPts |
If |
yesYVal |
Y value to be considered 'yes', to be coded 1 rather than 0. |
object |
An object returned by |
xNew |
Dataframe to predict new cases. Must be in the same format
as |
... |
Further arguments. |
The dsldLogit
function fits a logistic
regression model to the response variable. Interactions are handled
as in dsldLinear
.
The dsldLog
function returns an S3 object of class 'dsldGLM',
with one component for each level of S. Each component includes
information about the fitted model.
N. Matloff, A. Mittal, A. Ashok
data(lsa) newData <- lsa[c(2,22,222,2222),-c(8,11)] log1 <- dsldLogit(lsa,'bar','race1', newData, interactions = TRUE, 'TRUE') coef(log1) vcov(log1) summary(log1) predict(log1, newData) log2 <- dsldLogit(data = lsa, yName = 'bar',sName = 'gender', interactions = FALSE, yesYVal = 'TRUE') summary(log2)
data(lsa) newData <- lsa[c(2,22,222,2222),-c(8,11)] log1 <- dsldLogit(lsa,'bar','race1', newData, interactions = TRUE, 'TRUE') coef(log1) vcov(log1) summary(log1) predict(log1, newData) log2 <- dsldLogit(data = lsa, yName = 'bar',sName = 'gender', interactions = FALSE, yesYVal = 'TRUE') summary(log2)
Causal inference via matching models.
Wrapper for Matching::Match
.
dsldMatchedATE(data,yName,sName,yesSVal,yesYVal=NULL, propensFtn=NULL,k=NULL)
dsldMatchedATE(data,yName,sName,yesSVal,yesYVal=NULL, propensFtn=NULL,k=NULL)
data |
Data frame. |
yName |
Name of the response variable column. |
sName |
Name of the sensitive attribute column. The attribute must be dichotomous. |
yesSVal |
S value to be considered "yes," to be coded 1 rather than 0. |
yesYVal |
Y value to be considered "yes," to be coded 1 rather than 0. |
propensFtn |
Either 'glm' (logistic), or 'knn'. |
k |
Number of nearest neighbors if |
This is a dsld wrapper for Matching::Match
.
Matched analysis is typically applied to measuring "treatment effects," but is often applied in situations in which the "treatment," S here, is an immutable attribute such as race or gender. The usual issues concerning observational studies apply.
The function dsldMatchedATE
finds the estimated mean difference
between the matched Y pairs in the treated/nontreated (exposed and
non-exposed) groups, with covariates X in data
other than the
yName
and sName
columns.
In the propensity model case, we estimate P(S = 1 | X), either by a logistic or k-NN model.
Object of class 'Match'. See documentation in the Matching package.
N. Matloff
data(lalonde,package='Matching') ll <- lalonde ll$treat <- as.factor(ll$treat) ll$re74 <- NULL ll$re75 <- NULL summary(dsldMatchedATE(ll,'re78','treat','1')) summary(dsldMatchedATE(ll,'re78','treat','1',propensFtn='glm')) summary(dsldMatchedATE(ll,'re78','treat','1',propensFtn='knn',k=15))
data(lalonde,package='Matching') ll <- lalonde ll$treat <- as.factor(ll$treat) ll$re74 <- NULL ll$re75 <- NULL summary(dsldMatchedATE(ll,'re78','treat','1')) summary(dsldMatchedATE(ll,'re78','treat','1',propensFtn='glm')) summary(dsldMatchedATE(ll,'re78','treat','1',propensFtn='knn',k=15))
Nonparametric comparison of sensitive groups.
dsldML(data,yName,sName,qeMLftnName,sComparisonPts="rand5", opts=NULL,holdout=NULL)
dsldML(data,yName,sName,qeMLftnName,sComparisonPts="rand5", opts=NULL,holdout=NULL)
data |
A data frame. |
yName |
Name of the response variable column. |
sName |
Name(s) of the sensitive attribute column(s). |
qeMLftnName |
Quoted name of a prediction function in the |
sComparisonPts |
Data frame of one or more data points at which the regression function is to be estimated for each level of S. If this is 'rand5', then the said data points will consist of five randomly chosen rows in the original dataset. |
opts |
An R list specifying arguments for the above |
holdout |
The size of holdout set. |
In a linear model with no interactions, one can speak of "the"
difference in mean Y given X across treatments, independent of X.
In a nonparametric analysis, there is interaction by definition,
and one can only speak of differences across treatments for a
specific X value. Hence the need for the argument
sComparisonPts
.
The specified qeML
function will be called on the indicated data once
for each level of the sensitive variable. For each such level, estimated
regression function values will be obtained for each row in
sComparisonPts
.
An R list. The first component consists of the holdout-set prediction accuracies, while the second is a data frame predicted values for each sensitive group.
N. Matloff
data(svcensus) w <- dsldML(svcensus,'wageinc','gender',qeMLftnName='qeKNN', opts=list(k=50)) print(w)
data(svcensus) w <- dsldML(svcensus,'wageinc','gender',qeMLftnName='qeKNN', opts=list(k=50)) print(w)
Plotly 3D visualization of a dataset on 3 axes, with points color-coded on a 4th variable.
dsldScatterPlot3D(data, yNames, sName, sGroups = NULL, sortedBy = "Name", numGroups = 8, maxPoints = NULL, xlim = NULL, ylim = NULL, zlim = NULL, main = NULL, colors = "Paired", opacity = 1, pointSize = 8)
dsldScatterPlot3D(data, yNames, sName, sGroups = NULL, sortedBy = "Name", numGroups = 8, maxPoints = NULL, xlim = NULL, ylim = NULL, zlim = NULL, main = NULL, colors = "Paired", opacity = 1, pointSize = 8)
data |
Data frame with at least 4 columns. |
yNames |
Vector of the indices or names of the columns of the data frame to be graphed on the 3 axes. |
sName |
Index or name of the column that contains the groups for which the data will be grouped by. This will affect the colors of the points of the graph. This column must be an R factor. |
sGroups |
Vector of the names of the groups for which the data will be grouped by.
Every value in the vector must exist in the |
sortedBy |
Controls how "Name" gets the first values alphabetically. "Frequency" gets the most frequently occuring values. "Frequency-Descending" gets the least frequently occuring values. |
numGroups |
Number of groups to be automatically generated by the function. If
|
maxPoints |
Limit to how many points may be displayed on the graph. There is no limit by default. |
xlim , ylim , zlim
|
The x, y and z limits, each a vector with c(min, max). |
main |
The title of the graph. By default, the |
colors |
Either a colorbrewer2.org palette name (e.g. "YlOrRd" or "Blues"), or a vector of colors to interpolate in hexadecimal "#RRGGBB" format, or a color interpolation function like colorRamp(). |
opacity |
A value between 0 and 1. |
pointSize |
A value above 1. |
An interactive Plotly visualization will be created, with the three
variables specified in yNames
. Points will be color-coded
according to sName
. The plot can be rotated etc. using the mouse.
No value, plot.
J. Tran and B. Zarate
https://plotly.com/r/3d-scatter-plots/
data(lsa) dsldScatterPlot3D(lsa,sName = "race1", yNames=c("ugpa", "lsat","age"), xlim=c(2,4))
data(lsa) dsldScatterPlot3D(lsa,sName = "race1", yNames=c("ugpa", "lsat","age"), xlim=c(2,4))
Evaluate feature sets for predicting Y while considering the Fairness-Utility Tradeoff.
dsldTakeALookAround(data, yName, sName, maxFeatureSetSize = (ncol(data) - 2), holdout = floor(min(1000,0.1*nrow(data))))
dsldTakeALookAround(data, yName, sName, maxFeatureSetSize = (ncol(data) - 2), holdout = floor(min(1000,0.1*nrow(data))))
data |
Data frame. |
yName |
Name of the response variable column. |
sName |
Name of the sensitive attribute column. |
maxFeatureSetSize |
Maximum number of combinations of features to be included in the data frame. |
holdout |
If not NULL, form a holdout set of the specified size. After fitting to the remaining data, evaluate accuracy on the test set. |
This function provides a tool for exploring feature combinations to use in predicting an outcome Y from features X and a sensitive variable S.
The features in X will first be considered singly, then doubly and so
on, up though feature combination size maxFeatureSetSize
. Y is
prediction from X either a linear model (numeric Y) or logit
(dichotomous Y).
The accuracy (based on qeML holdout) will be computed for each of these cases: (a) Y predicted from the given feature combination C, (b) Y predicted from the given feature combination C plus S, and (c) S predicted from C. The difference between columns 'a' and 'b' shows the sacrifice in utility stemming from not using S in our prediction of Y. (Due to sampling variation, it is possible for column 'b' to be larger than 'a'.) The value in column 'c' shows fairness, the smaller the fairer.
Data frame whose first column consists of the variable names, followed by columns 'a', 'b' and 'c' as described in 'details'.
N. Matloff, A. Ashok, S. Martha, A. Mittal
# investigate predictive accuracy for a continuous Y, # 'wageinc', using the default arguments for maxFeatureSetSize = 4 data(svcensus) dsldTakeALookAround(svcensus, 'wageinc', 'gender', 4) # investigate the predictive accuracy for a categorical Y, # 'educ', using the default arguments for maxFeatureSetSize = 4 dsldTakeALookAround(svcensus, 'educ', 'gender')
# investigate predictive accuracy for a continuous Y, # 'wageinc', using the default arguments for maxFeatureSetSize = 4 data(svcensus) dsldTakeALookAround(svcensus, 'wageinc', 'gender', 4) # investigate the predictive accuracy for a categorical Y, # 'educ', using the default arguments for maxFeatureSetSize = 4 dsldTakeALookAround(svcensus, 'educ', 'gender')
Fictional CVs sent to real employers to investigate discrimination via given names. See Mullainathan and Bertran (2004).
Mullainathan, S. and Bertran, M. (2004). Are Emily and Greg More Employable Than Lakisha and Jamal? A Field Experiment on Labor Market Discrimination. American Economic Review, 94:991-1013
The dataset provides applicant information (including race, income, loan
information, etc.) The response variable indicates whether or not the
applicant was approved for the loan. Additional details can be found in
the SortedEffects
package.
Via qeML: This data set is adapted from the 2000 Census, restricted to programmers and engineers in the Silicon Valley area.
Attempts to load the specified package, halting execution upon failure.
getSuggestedLib(pkgName)
getSuggestedLib(pkgName)
pkgName |
Name of the package to be checked/loaded. |
No value, just side effects.