Predicting Gender from Names Using Historical Data

A common problem for researchers who work with data, especially historians, is that a data set has a list of people with names but does not identify the gender of the person. Since first names are often indicate gender, it should be possible to predict gender using names. Existing implementations, for example the Natural Language Toolkit implementation based on the Kantrowitz name corpus, sometimes rely on a simple list of names classified as male or female: John is male; Jenny is female; and so on. The problem with that implementation is twofold. First, some names are ambiguous: is Leslie a male or female name? It would be good to state in precise terms how likely it is that a name is male or female. Second, the gender of names often change over time, at the same time that they vary in popularity. To illustrate the general trend, take the name Madison. That name went from being almost exclusively male to almost exclusively female for children born in the United States after the year 1985.

plot of chunk unnamed-chunk-2

Predicting gender from names requires a fundamentally historical method. The gender package provides a way to calculate the proportion of male and female names given a year or range of birth years. The predictions are based on calculations from historical data sets. For now these data sets are limited to United States sources, and are drawn from Census Bureau and Social Security Administration data.

About the data sets

The Census data is provided by IPUMS USA from the Minnesota Population Center, University of Minnesota. The IPUMS data includes 1% and 5% samples from the Census returns. The Census, taken decennially, includes respondent's birth dates and gender. With the gender package, it is possible to use this data set for years between 1789 and 1930. The data set includes approximately 339,967 unique names.

The Social Security Administration data was collected from applicants to Social Security. The Social Security Board was created in the New Deal in 1935. Early applicants, however, were people who were nearing retirement age not people who were being born, so the data set extends further into the past. However, the Social Security Administration did not immediately require all persons born in the United States to register for a Social Security Number. (See Shane Landrum, “The State's Big Family Bible: Birth Certificates, Personal Identity, and Citizenship in the United States, 1840–1950” [PhD dissertation, Brandeis University, 2014].) A consequence of this—for reasons that are not entirely clear—is that for years before 1918, the SSA data set is heavily female; after about 1940 it skews slightly male. For this reason this package corrects the prediction to assume a secondary sex ratio that is evenly distributed between males and females. Also, the SSA data set only includes names that were used more than five times in a given year, so the “long tail” of names is excluded. Even so, the data set includes 91,320 unique names. The SSA data set extends from 1880 to 2012, but for years before 1930 you should use the IPUMS method.

Predicting gender for single names

The simplest way to predict gender from a name is to pass the name to the function. Notice that the capitalization of the name passed to the function does not matter.

gender("Madison")
## $name
## [1] "Madison"
## 
## $proportion_male
## [1] 0.0162
## 
## $proportion_female
## [1] 0.9838
## 
## $gender
## [1] "female"
## 
## $year_min
## [1] 1932
## 
## $year_max
## [1] 2012

The function returns a list. The name is obviously the name that was encoded. proportion_male and proportion_female show the relative proportions of male and female uses in a given range of years. The values year_min and year_max report the range of years that the function is using to predict gender. Finally the gender value is the prediction itself. The value will be male or female if the proportion is above 0.5; it will be “either” if the proportion is exactly 0.5; and the value will be NA if the gender cannot be predicted with the given method and range of years.

In practice, you are better off being explicit about the method and the range of years that you are using. The range of years can be a single value (e.g., 1890) or a range of years in the form c(1890, 1900). The years and the method can be specified with arguments to the gender() function. Notice the varying proportions and predictions for different years and methods. You should think carefully about the data from which you wish to predict gender and which data set is most appropriate.

gender("Madison", method = "ipums", years = 1850)
## $name
## [1] "Madison"
## 
## $proportion_male
## [1] 1
## 
## $proportion_female
## [1] 0
## 
## $gender
## [1] "male"
## 
## $year_min
## [1] 1850
## 
## $year_max
## [1] 1850
gender("Madison", method = "ssa", years = 1950)
## $name
## [1] "Madison"
## 
## $proportion_male
## [1] 1
## 
## $proportion_female
## [1] 0
## 
## $gender
## [1] "male"
## 
## $year_min
## [1] 1950
## 
## $year_max
## [1] 1950
gender("Madison", method = "ssa", years = 2000)
## $name
## [1] "Madison"
## 
## $proportion_male
## [1] 0.0064
## 
## $proportion_female
## [1] 0.9936
## 
## $gender
## [1] "female"
## 
## $year_min
## [1] 2000
## 
## $year_max
## [1] 2000

Predicting gender from data frames

Most often you have a data set and you want to predict gender for multiple names. Consider this sample data set.

sample_names_df
##      names years
## 1     john  1930
## 2     john  1960
## 3     john  1990
## 4     john  2010
## 5     jane  1930
## 6     jane  1960
## 7     jane  1990
## 8     jane  2010
## 9  madison  1930
## 10 madison  1960
## 11 madison  1990
## 12 madison  2010
## 13 lindsay  1930
## 14 lindsay  1960
## 15 lindsay  1990
## 16 lindsay  2010

Here we have a data set with first names connected to years. It is important to emphasize that these years should be the years of birth. If you have years representing something else, you will have to find a way to figure out how to estimate the years of birth.

If we want to use the same range of years for all of the names, we can pass the names vector to the gender() function and use a constant range of years (in this case, the minimum and maximum year in the data set).

library(magrittr) # to use the %>% pipe operator
gender(sample_names_df$names, method = "ssa", years = c(1930, 2010)) %>%
  head()
## [[1]]
## [[1]]$name
## [1] "john"
## 
## [[1]]$proportion_male
## [1] 0.996
## 
## [[1]]$proportion_female
## [1] 0.004
## 
## [[1]]$gender
## [1] "male"
## 
## [[1]]$year_min
## [1] 1930
## 
## [[1]]$year_max
## [1] 2010
## 
## 
## [[2]]
## [[2]]$name
## [1] "john"
## 
## [[2]]$proportion_male
## [1] 0.996
## 
## [[2]]$proportion_female
## [1] 0.004
## 
## [[2]]$gender
## [1] "male"
## 
## [[2]]$year_min
## [1] 1930
## 
## [[2]]$year_max
## [1] 2010
## 
## 
## [[3]]
## [[3]]$name
## [1] "john"
## 
## [[3]]$proportion_male
## [1] 0.996
## 
## [[3]]$proportion_female
## [1] 0.004
## 
## [[3]]$gender
## [1] "male"
## 
## [[3]]$year_min
## [1] 1930
## 
## [[3]]$year_max
## [1] 2010
## 
## 
## [[4]]
## [[4]]$name
## [1] "john"
## 
## [[4]]$proportion_male
## [1] 0.996
## 
## [[4]]$proportion_female
## [1] 0.004
## 
## [[4]]$gender
## [1] "male"
## 
## [[4]]$year_min
## [1] 1930
## 
## [[4]]$year_max
## [1] 2010
## 
## 
## [[5]]
## [[5]]$name
## [1] "jane"
## 
## [[5]]$proportion_male
## [1] 0.003
## 
## [[5]]$proportion_female
## [1] 0.997
## 
## [[5]]$gender
## [1] "female"
## 
## [[5]]$year_min
## [1] 1930
## 
## [[5]]$year_max
## [1] 2010
## 
## 
## [[6]]
## [[6]]$name
## [1] "jane"
## 
## [[6]]$proportion_male
## [1] 0.003
## 
## [[6]]$proportion_female
## [1] 0.997
## 
## [[6]]$gender
## [1] "female"
## 
## [[6]]$year_min
## [1] 1930
## 
## [[6]]$year_max
## [1] 2010

The result is a list of lists. While we could deal with that data structure if we needed to, it is much easier to convert the list of lists to a data frame:

gender(sample_names_df$names,
       method = "ssa",
       years = c(1930, 2010)) %>%
  do.call(rbind.data.frame, .)
##       name proportion_male proportion_female gender year_min year_max
## 2     john          0.9960            0.0040   male     1930     2010
## 21    john          0.9960            0.0040   male     1930     2010
## 3     john          0.9960            0.0040   male     1930     2010
## 4     john          0.9960            0.0040   male     1930     2010
## 5     jane          0.0030            0.9970 female     1930     2010
## 6     jane          0.0030            0.9970 female     1930     2010
## 7     jane          0.0030            0.9970 female     1930     2010
## 8     jane          0.0030            0.9970 female     1930     2010
## 9  madison          0.0175            0.9825 female     1930     2010
## 10 madison          0.0175            0.9825 female     1930     2010
## 11 madison          0.0175            0.9825 female     1930     2010
## 12 madison          0.0175            0.9825 female     1930     2010
## 13 lindsay          0.0297            0.9703 female     1930     2010
## 14 lindsay          0.0297            0.9703 female     1930     2010
## 15 lindsay          0.0297            0.9703 female     1930     2010
## 16 lindsay          0.0297            0.9703 female     1930     2010

But in most cases you will want to associate a specific year with a specific name. This can be done using the Map() function.

results <- Map(gender,
               sample_names_df$names,
               years = sample_names_df$years,
               method = "ssa") %>%
  do.call(rbind.data.frame, .)
results
##             name proportion_male proportion_female gender year_min
## john        john          0.9926            0.0074   male     1930
## john1       john          0.9967            0.0033   male     1960
## john2       john          0.9970            0.0030   male     1990
## john3       john          0.9992            0.0008   male     2010
## jane        jane          0.0047            0.9953 female     1930
## jane1       jane          0.0027            0.9973 female     1960
## jane2       jane          0.0095            0.9905 female     1990
## jane3       jane          0.0000            1.0000 female     2010
## madison  madison          1.0000            0.0000   male     1930
## madison1 madison          1.0000            0.0000   male     1960
## madison2 madison          0.0870            0.9130 female     1990
## madison3 madison          0.0023            0.9977 female     2010
## lindsay  lindsay          1.0000            0.0000   male     1930
## lindsay1 lindsay          0.7274            0.2726   male     1960
## lindsay2 lindsay          0.0073            0.9927 female     1990
## lindsay3 lindsay          0.0000            1.0000 female     2010
##          year_max
## john         1930
## john1        1960
## john2        1990
## john3        2010
## jane         1930
## jane1        1960
## jane2        1990
## jane3        2010
## madison      1930
## madison1     1960
## madison2     1990
## madison3     2010
## lindsay      1930
## lindsay1     1960
## lindsay2     1990
## lindsay3     2010

Now you have a separate data frame with the results from the encoding. This can be merged back into the original data frame using a join:

joined <- merge(sample_names_df, results, 
                by.x = c("names", "years"), by.y = c("name", "year_min"))
joined
##      names years proportion_male proportion_female gender year_max
## 1     jane  1930          0.0047            0.9953 female     1930
## 2     jane  1960          0.0027            0.9973 female     1960
## 3     jane  1990          0.0095            0.9905 female     1990
## 4     jane  2010          0.0000            1.0000 female     2010
## 5     john  1930          0.9926            0.0074   male     1930
## 6     john  1960          0.9967            0.0033   male     1960
## 7     john  1990          0.9970            0.0030   male     1990
## 8     john  2010          0.9992            0.0008   male     2010
## 9  lindsay  1930          1.0000            0.0000   male     1930
## 10 lindsay  1960          0.7274            0.2726   male     1960
## 11 lindsay  1990          0.0073            0.9927 female     1990
## 12 lindsay  2010          0.0000            1.0000 female     2010
## 13 madison  1930          1.0000            0.0000   male     1930
## 14 madison  1960          1.0000            0.0000   male     1960
## 15 madison  1990          0.0870            0.9130 female     1990
## 16 madison  2010          0.0023            0.9977 female     2010

Predicting gender for yourself

By using the certainty option you can determine whether or not to return the proportion of male and female names. When predicting gender the gender() function assumes that any proportion above 0.5 should be male or female. If you want to be more certain about your prediction, you can use the values in the proportion columns and decide that you are certain about your prediction only at a threshold, for instance, of 0.7.

Accessing the data sets

The data sets which are a part of this package can be viewed by running the following command.

data(package = "gender")

You can then load any of those data sets and work with them directly.

data(ssa_national)
ssa_national
## Source: local data frame [1,603,026 x 4]
## 
##         name year female male
## 1      aaban 2007      0    5
## 2      aaban 2009      0    6
## 3      aaban 2010      0    9
## 4      aaban 2011      0   11
## 5      aaban 2012      0   11
## 6      aabha 2011      7    0
## 7      aabha 2012      5    0
## 8      aabid 2003      0    5
## 9  aabriella 2008      5    0
## 10     aadam 1987      0    5
## ..       ...  ...    ...  ...