Jamie's Blog

Lessons from a life of startups, coding, countryside, and kids

Analysing a List of Names for Gender

2015 05 26 at 22.51

WebSummit decided that sending the names of 3000 attendees to people who still hadn’t availed of their wonderful offer would be a great idea. Personally, I found it offensive to receive the email — and my name wasn’t even on the list. I wonder how they felt? In fairness, it’s apparently made clear to attendees that their name will be used in this way but it’s pretty icky-feeling marketing.

Anyway, rant aside.

What’s a boy to do with 3000 names?

Now I have a list of 3000 names. What can I do with them?

Since I was minding my son that afternoon, I did something quick so I could get back to playing football.

I used the Gender API service to analyse the first 500 names (because I’m also too cheap to buy credits for the full 3000):

require 'rest-client'

raw = []
results = Hash.new 0
errors = 0
num_names = 0

File.foreach 'names.txt' do |name|
  params = {key: 'XXXXXXXXXXXXXXXXXX'}
  params[:name] = name.split(' ')[0]
  response = RestClient.get 'https://gender-api.com/get', params: params
  raw << response.to_s

  if response.code == 200
    json = JSON.parse(response)
    gender = json['gender']

    num_names += 1
    results[gender] = results[gender] + 1

    break if num_names > 500
  else
    errors += 1
  end
end

results.each do |gender, count|
  percent = (count.to_f / num_names.to_f) * 100
  puts "#{gender}: #{count} (%.2f %)" % percent
end
puts "#{errors} Errors"

File.open("raw.json", "w+") do |f|
  f.puts raw
end

This script reads in a file of names, one per line. We split the full name and pass the first name to the gender api. We store the raw json in a file (to cache locally in case we want to reuse it) and count the resulting genders. Then we print the results. Not rocket science but as little fun.

Results

The output from the first 500 names was:

websummit_names: ruby count.rb
unknown: 6 (1.20)%
male: 384 (76.65)%
female: 111 (22.16)%
0 Errors

In fairness, a 76/22 male/female split is actually pretty good in the world of tech conferences. It’s still too unbalanced but not as bad as I feared.

Next steps

I’m probably not going to do anything more with the names but here’s a few other ideas:

  • Use the free FullContact Name API: it has some more information including gender, median age etc
  • Use the Facebook Graph API or maybe LinkedIn to find more information about people with these names. There will be many people with the same name but it should be possible to disambiguate it, perhaps by finding the person with the shortest connections to other people in the list — bird of a feather, flock together! (assuming this is technically possible)
  • LinkedIn could be an interesting source, particularly of company data. If you can get company data, you get guess an email address. And if you’ve got someone’s email address, you can look up all sorts of information.