A common problem for researchers who work with data, especially historians, is that a data set has a list of people with names but does not identify the gender of the person. Since first names are often indicate gender, it should be possible to predict gender using names. Existing implementations, for example the Natural Language Toolkit implementation based on the Kantrowitz name corpus, sometimes rely on a simple list of names classified as male or female: John is male; Jenny is female; and so on. The problem with that implementation is twofold. First, some names are ambiguous: is Leslie a male or female name? It would be good to state in precise terms how likely it is that a name is male or female. Second, the gender of names often change over time, at the same time that they vary in popularity. To illustrate the general trend, take the name Madison. That name went from being almost exclusively male to almost exclusively female for children born in the United States after the year 1985.
Predicting gender from names requires a fundamentally historical method. The gender
package provides a way to calculate the proportion of male and female names given a year or range of birth years. The predictions are based on calculations from historical data sets. For now these data sets are limited to United States sources, and are drawn from Census Bureau and Social Security Administration data.
The Census data is provided by IPUMS USA from the Minnesota Population Center, University of Minnesota. The IPUMS data includes 1% and 5% samples from the Census returns. The Census, taken decennially, includes respondent's birth dates and gender. With the gender package, it is possible to use this data set for years between 1789 and 1930. The data set includes approximately 339,967 unique names.
The Social Security Administration data was collected from applicants to Social Security. The Social Security Board was created in the New Deal in 1935. Early applicants, however, were people who were nearing retirement age not people who were being born, so the data set extends further into the past. However, the Social Security Administration did not immediately require all persons born in the United States to register for a Social Security Number. (See Shane Landrum, “The State's Big Family Bible: Birth Certificates, Personal Identity, and Citizenship in the United States, 1840–1950” [PhD dissertation, Brandeis University, 2014].) A consequence of this—for reasons that are not entirely clear—is that for years before 1918, the SSA data set is heavily female; after about 1940 it skews slightly male. For this reason this package corrects the prediction to assume a secondary sex ratio that is evenly distributed between males and females. Also, the SSA data set only includes names that were used more than five times in a given year, so the “long tail” of names is excluded. Even so, the data set includes 91,320 unique names. The SSA data set extends from 1880 to 2012, but for years before 1930 you should use the IPUMS method.
The simplest way to predict gender from a name is to pass the name to the function. Notice that the capitalization of the name passed to the function does not matter.
gender("Madison")
## $name
## [1] "Madison"
##
## $proportion_male
## [1] 0.0162
##
## $proportion_female
## [1] 0.9838
##
## $gender
## [1] "female"
##
## $year_min
## [1] 1932
##
## $year_max
## [1] 2012
The function returns a list. The name
is obviously the name that was encoded. proportion_male
and proportion_female
show the relative proportions of male and female uses in a given range of years. The values year_min
and year_max
report the range of years that the function is using to predict gender. Finally the gender
value is the prediction itself. The value will be male
or female
if the proportion is above 0.5
; it will be “either” if the proportion is exactly 0.5
; and the value will be NA
if the gender cannot be predicted with the given method and range of years.
In practice, you are better off being explicit about the method and the range of years that you are using. The range of years can be a single value (e.g., 1890
) or a range of years in the form c(1890, 1900)
. The years and the method can be specified with arguments to the gender()
function. Notice the varying proportions and predictions for different years and methods. You should think carefully about the data from which you wish to predict gender and which data set is most appropriate.
gender("Madison", method = "ipums", years = 1850)
## $name
## [1] "Madison"
##
## $proportion_male
## [1] 1
##
## $proportion_female
## [1] 0
##
## $gender
## [1] "male"
##
## $year_min
## [1] 1850
##
## $year_max
## [1] 1850
gender("Madison", method = "ssa", years = 1950)
## $name
## [1] "Madison"
##
## $proportion_male
## [1] 1
##
## $proportion_female
## [1] 0
##
## $gender
## [1] "male"
##
## $year_min
## [1] 1950
##
## $year_max
## [1] 1950
gender("Madison", method = "ssa", years = 2000)
## $name
## [1] "Madison"
##
## $proportion_male
## [1] 0.0064
##
## $proportion_female
## [1] 0.9936
##
## $gender
## [1] "female"
##
## $year_min
## [1] 2000
##
## $year_max
## [1] 2000
Most often you have a data set and you want to predict gender for multiple names. Consider this sample data set.
sample_names_df
## names years
## 1 john 1930
## 2 john 1960
## 3 john 1990
## 4 john 2010
## 5 jane 1930
## 6 jane 1960
## 7 jane 1990
## 8 jane 2010
## 9 madison 1930
## 10 madison 1960
## 11 madison 1990
## 12 madison 2010
## 13 lindsay 1930
## 14 lindsay 1960
## 15 lindsay 1990
## 16 lindsay 2010
Here we have a data set with first names connected to years. It is important to emphasize that these years should be the years of birth. If you have years representing something else, you will have to find a way to figure out how to estimate the years of birth.
If we want to use the same range of years for all of the names, we can pass the names vector to the gender()
function and use a constant range of years (in this case, the minimum and maximum year in the data set).
library(magrittr) # to use the %>% pipe operator
gender(sample_names_df$names, method = "ssa", years = c(1930, 2010)) %>%
head()
## [[1]]
## [[1]]$name
## [1] "john"
##
## [[1]]$proportion_male
## [1] 0.996
##
## [[1]]$proportion_female
## [1] 0.004
##
## [[1]]$gender
## [1] "male"
##
## [[1]]$year_min
## [1] 1930
##
## [[1]]$year_max
## [1] 2010
##
##
## [[2]]
## [[2]]$name
## [1] "john"
##
## [[2]]$proportion_male
## [1] 0.996
##
## [[2]]$proportion_female
## [1] 0.004
##
## [[2]]$gender
## [1] "male"
##
## [[2]]$year_min
## [1] 1930
##
## [[2]]$year_max
## [1] 2010
##
##
## [[3]]
## [[3]]$name
## [1] "john"
##
## [[3]]$proportion_male
## [1] 0.996
##
## [[3]]$proportion_female
## [1] 0.004
##
## [[3]]$gender
## [1] "male"
##
## [[3]]$year_min
## [1] 1930
##
## [[3]]$year_max
## [1] 2010
##
##
## [[4]]
## [[4]]$name
## [1] "john"
##
## [[4]]$proportion_male
## [1] 0.996
##
## [[4]]$proportion_female
## [1] 0.004
##
## [[4]]$gender
## [1] "male"
##
## [[4]]$year_min
## [1] 1930
##
## [[4]]$year_max
## [1] 2010
##
##
## [[5]]
## [[5]]$name
## [1] "jane"
##
## [[5]]$proportion_male
## [1] 0.003
##
## [[5]]$proportion_female
## [1] 0.997
##
## [[5]]$gender
## [1] "female"
##
## [[5]]$year_min
## [1] 1930
##
## [[5]]$year_max
## [1] 2010
##
##
## [[6]]
## [[6]]$name
## [1] "jane"
##
## [[6]]$proportion_male
## [1] 0.003
##
## [[6]]$proportion_female
## [1] 0.997
##
## [[6]]$gender
## [1] "female"
##
## [[6]]$year_min
## [1] 1930
##
## [[6]]$year_max
## [1] 2010
The result is a list of lists. While we could deal with that data structure if we needed to, it is much easier to convert the list of lists to a data frame:
gender(sample_names_df$names,
method = "ssa",
years = c(1930, 2010)) %>%
do.call(rbind.data.frame, .)
## name proportion_male proportion_female gender year_min year_max
## 2 john 0.9960 0.0040 male 1930 2010
## 21 john 0.9960 0.0040 male 1930 2010
## 3 john 0.9960 0.0040 male 1930 2010
## 4 john 0.9960 0.0040 male 1930 2010
## 5 jane 0.0030 0.9970 female 1930 2010
## 6 jane 0.0030 0.9970 female 1930 2010
## 7 jane 0.0030 0.9970 female 1930 2010
## 8 jane 0.0030 0.9970 female 1930 2010
## 9 madison 0.0175 0.9825 female 1930 2010
## 10 madison 0.0175 0.9825 female 1930 2010
## 11 madison 0.0175 0.9825 female 1930 2010
## 12 madison 0.0175 0.9825 female 1930 2010
## 13 lindsay 0.0297 0.9703 female 1930 2010
## 14 lindsay 0.0297 0.9703 female 1930 2010
## 15 lindsay 0.0297 0.9703 female 1930 2010
## 16 lindsay 0.0297 0.9703 female 1930 2010
But in most cases you will want to associate a specific year with a specific name. This can be done using the Map()
function.
results <- Map(gender,
sample_names_df$names,
years = sample_names_df$years,
method = "ssa") %>%
do.call(rbind.data.frame, .)
results
## name proportion_male proportion_female gender year_min
## john john 0.9926 0.0074 male 1930
## john1 john 0.9967 0.0033 male 1960
## john2 john 0.9970 0.0030 male 1990
## john3 john 0.9992 0.0008 male 2010
## jane jane 0.0047 0.9953 female 1930
## jane1 jane 0.0027 0.9973 female 1960
## jane2 jane 0.0095 0.9905 female 1990
## jane3 jane 0.0000 1.0000 female 2010
## madison madison 1.0000 0.0000 male 1930
## madison1 madison 1.0000 0.0000 male 1960
## madison2 madison 0.0870 0.9130 female 1990
## madison3 madison 0.0023 0.9977 female 2010
## lindsay lindsay 1.0000 0.0000 male 1930
## lindsay1 lindsay 0.7274 0.2726 male 1960
## lindsay2 lindsay 0.0073 0.9927 female 1990
## lindsay3 lindsay 0.0000 1.0000 female 2010
## year_max
## john 1930
## john1 1960
## john2 1990
## john3 2010
## jane 1930
## jane1 1960
## jane2 1990
## jane3 2010
## madison 1930
## madison1 1960
## madison2 1990
## madison3 2010
## lindsay 1930
## lindsay1 1960
## lindsay2 1990
## lindsay3 2010
Now you have a separate data frame with the results from the encoding. This can be merged back into the original data frame using a join:
joined <- merge(sample_names_df, results,
by.x = c("names", "years"), by.y = c("name", "year_min"))
joined
## names years proportion_male proportion_female gender year_max
## 1 jane 1930 0.0047 0.9953 female 1930
## 2 jane 1960 0.0027 0.9973 female 1960
## 3 jane 1990 0.0095 0.9905 female 1990
## 4 jane 2010 0.0000 1.0000 female 2010
## 5 john 1930 0.9926 0.0074 male 1930
## 6 john 1960 0.9967 0.0033 male 1960
## 7 john 1990 0.9970 0.0030 male 1990
## 8 john 2010 0.9992 0.0008 male 2010
## 9 lindsay 1930 1.0000 0.0000 male 1930
## 10 lindsay 1960 0.7274 0.2726 male 1960
## 11 lindsay 1990 0.0073 0.9927 female 1990
## 12 lindsay 2010 0.0000 1.0000 female 2010
## 13 madison 1930 1.0000 0.0000 male 1930
## 14 madison 1960 1.0000 0.0000 male 1960
## 15 madison 1990 0.0870 0.9130 female 1990
## 16 madison 2010 0.0023 0.9977 female 2010
By using the certainty
option you can determine whether or not to return the proportion of male and female names. When predicting gender the gender()
function assumes that any proportion above 0.5
should be male or female. If you want to be more certain about your prediction, you can use the values in the proportion columns and decide that you are certain about your prediction only at a threshold, for instance, of 0.7
.
The data sets which are a part of this package can be viewed by running the following command.
data(package = "gender")
You can then load any of those data sets and work with them directly.
data(ssa_national)
ssa_national
## Source: local data frame [1,603,026 x 4]
##
## name year female male
## 1 aaban 2007 0 5
## 2 aaban 2009 0 6
## 3 aaban 2010 0 9
## 4 aaban 2011 0 11
## 5 aaban 2012 0 11
## 6 aabha 2011 7 0
## 7 aabha 2012 5 0
## 8 aabid 2003 0 5
## 9 aabriella 2008 5 0
## 10 aadam 1987 0 5
## .. ... ... ... ...