Teradata is paying me to join a panel on “data science” in downtown Boston, Tuesday May 22, at 3:00 pm. A planning phone call led me to jot down a few notes on the subject, which I’m herewith adapting into a blog post.
For starters, I have some concerns about the concepts of data science and data scientist. Too often, the term “data scientist” is used to suggest that one person needs to have strong skills both in analytics and in data management. But in reality, splitting those roles makes perfect sense. Further:
- It may or may not make sense to say that a computer scientist is doing “science”; the term “data scientist” inherits that ambiguity.
- It may or may not make sense to say that a corporate scientist is doing “science”; for example, a petroleum geologist might do very valuable work without making any scientific discoveries. The term “data scientist” inherits that ambiguity too.
- Too often, people use the term big data as if it were something radically new, rather than a continuation of what has been done in large-scale analytic data management for decades. “Data science” has a similar problem.
- The term “data science” sounds as if you need specialized academic training to do it, which isn’t really true.
The leader in raising these issues is probably Neil Raden.
But there’s one respect in which I think the term “data science” is highly appropriate. In conventional science, gathering data is just as much of an accomplishment as analyzing it. Indeed, most Nobel Prizes are given for experimental results. Similarly, if you’re doing data science, you should be thinking hard about how to corral ever more useful data. Techniques include but are not limited to:
- Keeping data you used to throw away. This has driven a lot of growth in relational data warehouses and big bit buckets alike.
- Bribing customers and prospects. Loyalty cards are the paradigmatic example.
- Split testing. The more internet-based users you have, the more tests you can do.
- Storing derived data. That can be as simple as pre-computing the scores from your predictive analytics model, or it can be as complex as running a 50-step sequence of Hadoop jobs.
- Getting data from third parties, for example:
- Supply chain partners (right now this rarely amounts to more than simple BI, but that could change in the future).
- Data vendors of various kinds (e.g. credit bureaus).
- Social media/the internet in general, which also usually involves some kind of service provider.