Saturday, October 15, 2016

Wisdom Of The Week

Machine learning is like a deep-fat fryer. If you’ve never deep-fried something before, you think to yourself: "This is amazing! I bet this would work on anything!”

And it kind of does.

And in any deep frying situation, a good question to ask is: what is this stuff being fried in?

[---]

So what’s your data being fried in? These algorithms train on large collections that you know nothing about. Sites like Google operate on a scale hundreds of times bigger than anything in the humanities. Any irregularities in that training data end up infused into in the classifier.

For this reason I've referred to machine learning as money laundering for bias. It's easy to put anything you want in training data.

For example, if you go to Google translate and paste in an Arabic-language article about terrorism or the war in Syria, you get something that reads like it was written by a native speaker of English. If you type in a kid's letter from camp, or an extract from a novel, the English text reads like it was written by the Frankenstein monster.

This isn't because Google's algorithm is a gung-ho war machine, but reflects the corpus of data it was trained on. I'm sure other languages would show their own irregularities.

Prejudice isn’t always a problem. Some uses of machine learning are inherently benign. In an earlier talk, we heard about identifying poetry in newspapers based on formatting, an excellent use of image recognition. OCR is another area where there are no concerns.

Others, though, would be problematic. I’d be very wary of using “sentiment analysis” or anything to do with social networks without careful experimental design.

I find it helpful to think of algorithms as a dim-witted but extremely industrious graduate student, whom you don't fully trust. You want a concordance made? An index? You want them to go through ten million photos and find every picture of a horse? Perfect.

You want them to draw conclusions on gender based on word use patterns? Or infer social relationships from census data? Now you need some adult supervision in the room.

Besides these issues of bias, there's also an opportunity cost in committing to computational tools. What irks me about the love affair with algorithms is that they remove a lot of the potential for surprise and serendipity that you get by working with people.

- Machine Learning Is Like a Deep-fat Fryer

No comments: