Toward Better Data Science: Mostly People, But Also Process and Technology – Forbes
Finding from Domino Data Lab survey
Domino Data Lab
I recently moderated a webinar roundtable on behalf of Domino Data Lab called “Unleash Data Science for the Model-Driven Business You Expect.” I don’t know that everyone expects a model-driven business, but some people clearly do, and many would benefit from it. The goal of the panel was to illuminate just what is involved in achieving a model-driven business.
We had some great panelists, but unfortunately one of them, Irina Malkova, who heads internal data science at Salesforce, had to drop off for a minor medical emergency. We talked before the session, however, so I will mention some of her comments. John Thompson, an old friend who heads data science for the large biotech firm CSL Behring (they make good things out of blood plasma) and a successful author, was on the panel. Matt Aslett, who at the time of the webinar headed data, analytics, and AI research for 451 Research, part of S&P Global Market Intelligence, came in from the UK. And we also had a prominent representative of our sponsor: Nick Elprin, the CEO and Co-Founder of Domino.
For as long as I have been working in the area of technological change in business, the “people, process, and technology” troika has been a useful way to categorize the key elements of change. So we structured the webinar along those dimensions. The panelists all agreed that the human dimension was the most challenging, so we discussed that first.
Data Science Talent and Skills
Irina Malkova of Salesforce had mentioned before the panel that successful data science required a variety of task types—from framing business problems to be solved by AI, to collecting data, to developing algorithms, to deploying and maintaining models. Malkova commented that as a result, data science is hardly a one-person show. A variety of skills are necessary, leading to a variety of data scientist job types—or whatever an organization wants to call them. Elprin suggested that some skills could be made core to the data scientist role, and others could be expected in other types of roles. Domino has sponsored a recent survey suggesting that the lack of data science skills is the greatest impediment companies face.
Thompson mentioned that his company typically has data engineers, data scientists, a user interface and visual analytics person, and business subject matter experts on his teams. I mentioned that in order to ensure such collaboration, one large healthcare provider had recently combined its AI, analytics, digital, and IT organizations, but Thompson said he thought that was a step backward. Elprin agreed with Thompson, and said that it was most important for data science teams to be close to the business and to serve their objectives rather than those of IT. Aslett didn’t take a position on whether these groups need to be combined, but he did emphasize that they need to work closely together.
Data Science Processes
One issue at the intersection of people and process that I asked the panelists about involved the primary objective for using modern data science platforms like Domino’s. Is it to empower professionals to achieve greater productivity and performance, or to enable data science amateurs to produce models through automated machine learning? The former was the strong focus among all the panelists. Thompson said that data science professionals are his primary focus at CSL Behring, Aslett said he sees that as the primary focus in his research, and Elprin said that most customers are focused on professional data scientists as well, at least for non-commodity problems. Perhaps the “citizen data scientist” movement has yet to take full root.
When asked about data science processes, John Thompson said that he divides data science into macro and micro processes. The micro processes are those that data scientists use to collect and refine data and craft models, and as long as they are done well no one pays much attention. More difficult, he said, are the macro processes for getting data science models into production deployment. They involve complex relationships between business stakeholders, data scientists, and technology providers, and require careful change management.
Nick Elprin said that an up-front feasibility assessment and kickoff with the business will provide clarity on what the business change objectives are, how decisions will be affected, and what process changes will be necessary. Thompson at CSL said his organization creates a “project charter” to get clarity on the business objectives and needed changes in data science projects.
Matt Aslett emphasized the importance of clear business KPIs within these macro processes in order to know whether value has been achieved. He said that in some recent research by his firm, two-thirds of companies doing data science at scale said they tracked ROIs from their projects, and the percentage went up to 97% for companies with 250 or more models in production.
Technology for Data Science Success
In the technology component of the roundtable, Nick Elprin from Domino appropriately led off the discussion. He said that data science platforms can now support a broad range of tasks in the data science process, although it’s more a matter of enabling people to perform the process than providing a silver bullet to address data science problems. He said that the data science technology ecosystem is evolving rapidly, mentioning the rapid rise and now decline of Hadoop as an example. He predicted further rapid progress in tools over the next three to five years. Given the rapid progress and change, he also commented that one of the primary objectives of companies he meets with is to avoid being tied into a particular tool or vendor. Instead, Elprin said, “They want to give their data scientists agility and flexibility to use whatever the right tool is for the job that they’re trying to accomplish.”
I asked Elprin if companies were trying to avoid getting locked into particular cloud vendors’ data science offerings. He said that many firms wanted to have hybrid infrastructure strategies and maintain flexibility that way. Over time, he said, open source offerings would provide the strongest capabilities.
Thompson agreed with Elprin; he said that his data scientists prefer development in Python over packaged software or cloud offerings. That, he said, enables them to “build the models that are most effective and precise and have precision in what they predict and prescribe.” The implication, however, is that they also have to build a user-friendly front end for models that are deployed into production.
Nick Elprin also brought up the need for collaboration capabilities for effective data science. He said, “Technology plays a big role in accelerating data science teams, by providing collaboration, knowledge reuse, and sharing in the same way that it’s needed for software engineering teams. How do we find and reuse each others’ work, how do we find past work that we’ve done. It’s particularly important for geographically distributed teams.” Aslett and Thompson both agreed on the importance of these collaboration capabilities.
The final topic discussed in the technology realm was feature engineering. John Thompson in data science, “feature engineering” (selecting, generating, and tuning the variables used in a machine learning model) is “where the magic happens.” He said that only a few data scientists are really good at it. His company uses a Domino feature store to make engineered features accessible and documented for other data scientists to use. I asked Thompson if he felt like the process of feature engineering was becoming more automated, and he said no—it’s still mostly an art, and a very important one for data science to thrive in an organization.
Then there was some brief discussion again of the human and cultural aspect of sharing and reusing features; instead of creating their own models and features, data scientists need to first look around to see what others have already done. That set of behaviors may require substantial cultural change.
It is fitting that the discussion of transforming data science started and ended with the human dimension. Despite great progress in technology and an increasing focus on process, the roundtable participants agreed that success in data science generally comes down to the data scientists who are doing the work. Perhaps that’s why John Thompson refused to provide the name of his data scientist who is a whiz at feature engineering.