Staffing a Data Science Team

In my previous post, I outlined the importance of data science in leveraging and monetizing the current proliferation of data in most digital industries. The data science group is becoming the de facto group responsible for gaining insight from a company’s data. The pressing question that most organizations without an existing data science group struggle with is how to begin building out such a team with the proper skillsets. Staffing any group is a challenge, but the complexities and mathematical rigor of data science means that properly gauging a prospective candidate’s qualifications becomes even more difficult. What do you do when you are just starting and there is no one at the company with any real data science experience? Hiring solely based on resumes and educational pedigree in the current cutthroat job market turns up an extremely high false positive rate in terms of qualified candidates. Even though this is a challenge for developing any new type of organization, the proverbial “the chicken or the egg” conundrum becomes worse with data science due to the specialized skills less common in other groups.

I have always been a big proponent of “top down” hiring in data science, and in general. As soon as there is a need for a single data scientist, a company should seek a high-level and very experienced resource, with the capability to both efficiently manage people, as well as actually perform the data science job function. This should be a Principal or senior-level employee, who already has domain expertise and can independently assess the approach, data, algorithms and technology needed to solve a high-level business need. This person will have already proven themselves, having successfully lead or acted as the data science architect in a high-performing data science team at a company where data is a priority. Vetting this level of employee in any field should be easier because they will have established a long-term track record and have a much higher likelihood of common or well-known references. Compare this to the availability of information on entry or mid-level data scientists, where there are only a few years of available information or job history with relatively sparse referral contacts..

A lot of companies feel this approach is overkill, especially when they aren’t sure that they need more than one or two data scientists at most. The salary difference between the two experience levels is large enough to turn a lot of companies away from this top-down approach but the alternative tends to be a much higher risk in the long run, with significantly lower odds of building a functional team. Consider the company that tries to hire someone “cheaper”: who will vet candidates technically during interviews? Without a technical expert, there is no way to ascertain whether a candidate’s resume matches their expertise. I have come across some amazing data science resumes (e.g. a Ph.D. from a great school, with a great list of skills and experience, etc), only to find a significantly lower level of knowledge when probed about the technical details of their work and how the algorithms they employed actually worked. The best people for the job aren’t always the best at marketing their own skills, and hiring managers are generally interested in top performers, not top marketers (this may change if you are actually hiring for marketing). It is important to have as much information as possible, and assessing a candidate’s working knowledge is the number one priority during the hiring process, due to the tremendous cost of mishires to an organization in terms of productivity and time.

Even experienced data scientists will agree that interviewing candidates is a challenge. A good data scientist needs to be both proficient at programming and math. There is usually no problem with finding experienced programmers to vet the programming side; the trickier issue is assessing data science skills. To this end, I will typically ask candidates to write down a very standard set of derivations key to statistics that I am sure they must have studied. If they are successful, I ask them to modify it in a way they probably have never seen, which will tell me if they actually understand each step they wrote.  For example, I could ask the interviewee to write down the math steps involved in the well-known ordinary least squares regression method (any other well-studied statistical method will suffice). This method minimizes the square of error terms in your data to give you a regression line. Anyone who fits data should know this. For those who answer correctly, I will ask them how the derivation changes if they want to do something like minimize the cube of errors. This kind of thing is never actually done and makes little sense from a practical perspective.  However, it allows me to find candidates who actually understood every step they wrote down and can modify the derivation to serve their needs. A data scientist MUST understand the algorithms they are running. This allows them to understand how to best clean and process data as well as troubleshoot any production and performance issues. I once asked an interviewee from a well-known data science masters’ program how the algorithm he used for his project worked. His response was “I don’t know. I didn’t bother with the details. Why do I need to worry about things like that when the computer handles it for me?” I replied with something like “if all we needed was a coding monkey to pipe data into a command line, then we would have been able to fill the job a long time ago.” As these words left my mouth, I envisioned an army of monkeys frantically typing over thousands of rows of keyboards and computers. Like the well-known Shakespeare analogy, one of these monkeys would have had a good chance of accidentally producing the right result. Sadly, even though they are extremely cheap labor, the housing costs would have been prohibitively high and the “smell” probably would have affected morale among human employees.  After my comments, the interviewee realized that his answer was not the best.

Aside from being knowledgeable, there is one critical characteristic that any data scientist must have to succeed: they must have the ability to learn new skills and techniques, almost at a breakneck pace. The technology stack and programming languages involved in data science have evolved so rapidly in the last decade that it is no longer effective to simply learn one language well and stick with it during an entire career. In the last ten years, we’ve seen the rise of Hadoop, the overwhelming acceptance and community support behind R, the maturing data science toolkits in python, and an entire slew of new distributed systems meant to process streaming or batch big data (e.g. AWS, Spark, Storm… the list goes on). Any effective data scientist needs to be able to quickly master and assess new technologies. This requires a decent background in the fundamentals of computer science as well as a natural knack at programming. In addition, there are so many types of statistical and machine learning techniques that it is much more important to have someone who can quickly come up to speed in any technique that may be suitable, rather than hire the candidate who has expertise in only one technique who may not be able to adapt as quickly. This field is ideal for someone with a zeal for constant learning. In the course of my 10-year career, I have consumed at least fifty math and programming books, plus hundreds of peer-reviewed papers. At some point, understanding new technologies and algorithms becomes second nature given enough previous background, but this skill is almost a requirement now given the massive proliferation of new technologies every year.

What I’ve discussed above are fundamental problems in hiring and starting a new group that I think are especially relevant to data science organizations. Every type of organization shares these issues, but experience has shown me that, compounded with an extremely fast-moving field and specialized skillsets, the data science team is harder to staff than others. However, the rewards of a successful team are vast. In my next post, I will go over how we leverage one new technology in particular - the graph database - to better serve customers in fraud analytics and help them find connections in their data.