David Gevorkyan, a Principal Software Engineer at eHarmony, recently gave a talk discussing “how Hadoop helps [eHarmony] to process over a billion possible matches into several highly compatible matches for each of our users per day.” Sounds pretty technical, right?
I watched the whole talk (53 minutes!) and I’ve pulled out some pieces for the non-techies out there. There were a lot of interesting tidbits about how eHarmony works. You can see the talk, and the slides, on eHarmony’s engineering blog.
First off, I’m very pleased eHarmony put something out there that gives us a little bit more knowledge about how they work. Transparency is a beautiful thing. Also, thanks so much to David, who was kind enough to answer some of my questions about eHarmony and his talk.
Now, on to the good stuff!
Dr. Neil Clark Warren, founder of eHarmony, came up with a way to systematically match people, using “29 dimensions of compatibility”. The exact 29 dimensions are not disclosed, but they include such things as humor, spirituality, sociability, and ambition.
Over 600,000 marriages have come from people meeting via eHarmony, or about 438 marriages per day (this accounts for about 5% of all new US marriages). eHarmony currently has about 50 million registered users.
David mentioned a study conducted by Harris Interactive for eHarmony that did an analysis on divorce rates, and for the 7-year period eHarmony has been operating, the divorce rate was about 4.8%. (Statistics about current national divorce rates vary, but some recent research puts it at about a 40-50% chance during one’s lifetime.. so that’s looking at marriages much longer than 7 years.)
David says that what differentiates eHarmony from other matchmaking sites like Match.com and OkCupid is eHarmony’s “compatibility matching system,” which has three parts:
- Compatibility matching: compatibility based on the personality and psychological profiles
- Affinity matching: historical data from the last 15 years that uses machine learning models to predict different things such as probability of communication between users
- Match distribution: ensuring we deliver the right matches at the right time to as many people as possible throughout the entire network
Step 1: Compatibility Matching
When you join eHarmony, you provide criteria such as preferences on distance, income, age range, religion, smoking and drinking preferences, and others. After that, you fill in a comprehensive relationship questionnaire (150 questions!), which is targeted to extract personality and psychological profiles. These questions provide eHarmony with information about personality, values, attributes, and beliefs. eHarmony then uses the “29 dimensions of compatibility” to make the matches.
Based on a marital satisfaction survey of 5000 users, eHarmony took the most highly-satisfied couples and uses their compatibility scores to predict new matches.
When a new user joins eHarmony, it runs them through “complex mathematical equations”, which produces a score–if the score is above the threshold for the highly-satisfied couples from the survey, it considers them compatible.
David shared with me the link to one of eHarmony’s matching patents.
On a technical note, eHarmony uses a data storage system called Voldemort (developed by LinkedIn) to store its one-billion+ potential matches per day.
Step 2: Affinity Matching
Based on 15 years of historical data, the system will predict probability of communication between two users (among other things). David says, ““Even though the users are compatible with each other, you might not always decide to give that user as a match.”
And why not? Well, it may be that the user has specified he/she will only communicate with someone within a certain distance, or a certain age range. So the system won’t try and match these people. David told me there is some flexibility with this, but if a person has listed something has “very important” then eHarmony won’t give you a match that doesn’t meet your criteria.
He showed an interesting slide on how distance in miles affects the probability of communication: most communication happens, not surprisingly, when users are nearer to each other. However, at some point (over about 1000 miles) it doesn’t really matter any more–I guess long distance is long distance!
David says most communication happens when the man is taller by 4 to 8 inches–and that men are more eager to talk to women who are taller than them than women are to talk to men who are shorter than them.
Different words you use to describe yourself in your profile affect the probability of communication–that is, how likely you are to get a message from someone else.
For men, these words are likely to get more messages: “perceptive, physically fit, passionate, intelligent, funny, optimistic”. And for women: “sweet, funny, ambitious, thoughtful, passionate”.
Each user has an average of about 1000 attributes, and altogether the users have answered about 4 billion questions. eHarmony makes tens of millions of potential daily matches. Now that’s a lot of data!
Technical note: originally they were using Amazon Web Services, but one issue was that they could not predict when processing jobs (such as predicting matches) would finish. Why does it matter? They want to deliver potential matches “first thing in the morning”.
Step 3: Match Distribution
eHarmony wants to make as many people on the system happy, so it tries to maximize communication between users. This is done via machine learning to try and determine how many matches to send per day, what time of day, etc.
Finally, someone in the audience asked why certain people are rejected by eHarmony. David said they do have machine learning algorithms in place that are a part of that, but did not give details.