Last week I was listening to Pieter Abbeel’s excellent new podcast ‘Robot Brains
’ and specifically an interview with Andrej Karpathy, the head of AI at Tesla. What fascinated me was how many similarities (apart from resources) there were between how we set up our machine learning ‘infrastructure’ for text and how Tesla have set up their approaches.
One key part is a belief that models aren’t created by data scientists but by domain-experts curating data. Karpathy calls this ‘software 2.0
’. In this shared approach the data scientists build the infrastructure for data labellers to train the models.
All of our ML models depend on vast amounts of training data however more isn’t better. I haven’t counted but I’d be surprised if we had less than 1 million rows in our carefully curated training dataset for the generic models. As I discuss in my People Analytics World presentations it’s all in the edge cases. We have approaches to find and label these edge cases. Tesla identify issues (for example signs hidden in foliage) and then ask their almost 1m cars to send back real-world examples for the labelling teams. Both Tesla and ourselves base Machine Learning approaches around a series of techniques called ‘Active Learning’.
In my workshop on text analysis at the forthcoming People Analytics World (see below for a discount code) I’m going to share not only how to do text analysis but also the infrastructure, including organisation structure, that I think you need to do this well. It’s not hard to find a guide to how to apply the latest algorithm to text. What I hope to provide is insight into how to build a scalable capability within your teams.
Have a great week.