How I trained a language detection AI in 20 minutes with a 97% accuracy

This article was first published on Towards Data Science - Medium.

Weird — I actually kind of look like that guy

This story is a step-by-step guide to how I built a language detection model using machine learning (that ended up being 97% accurate) in under 20 minutes.

Language detection is a great use case for machine learning, more specifically, for text classification. Given some text from an e-mail, news article, output of speech-to-text capabilities, or anywhere else, a language detection model will tell you what language it is it.

This is a great way to quickly categorize and sort information, and apply additional layers of workflows that are language specific. For example, if you want to apply spell checking to a Word document, you first have to pick the correct language for the dictionary being used. Otherwise you’re going to find the spell checker to be quite wrong.

Other use cases might include routing e-mails to the right geographically located customer service department, applying the correct subtitles or closed-captioning to a video, or applying some other language-specific text classification to the text you’re analyzing.

Ok, you get it, language detection is really useful, let’s move on to how I did it so quickly.

I started with this dataset. https://cloud.google.com/prediction/docs/language_id.txt

It’s basically a .csv with samples of English, French and Spanish. My goal was to see if I might train a machine learning model to understand the difference between those languages and then, given some new text, predict the language it was in.

So the first thing I did was spin up Classificationbox, a machine learning model builder that runs in a Docker container and has a simple API. This took less than a minute.

The output of the terminal

Then I cloned and downloaded this handy tool that makes it really easy to train Classificationbox with text files on your computer. This took another minute or so.

The next step was to convert the CSV into text files so that I might easily train Classificationbox.

A proper developer would skip this step and just parse the CSV file and make API calls to Classificationbox directly from there.

Here is some not great Go code I wrote in case you’re interested, if not, please skip to the next step.

After running this script, I had folders on my hard drive named for the different languages and inside each folder were text files with the language samples. It took me about 10 minutes to write the script and run it.

Now comes the fun part. I made sure Classificationbox was up and running, then I ran imgclass on the parent directory of the language folders. It took about 3 seconds to:

  1. Process all the samples
  2. Split 20% of the samples into a validation set
  3. Train Classificationbox with the training set
  4. Validate with the validation set

These were my results:

97% ! That’s pretty good for only spending 20 minutes on training a language detection machine learning model.

One important thing to note is that my classes were not balanced. I had different numbers of samples for each class which does not adhere to the best practices for training a model. Ideally, I would have the exact same number of examples in each class.

The point is, machine learning benefits best from experimentation. I strongly encourage everyone to give it try using Machine Box or any other tools. I hope I was able to demonstrate just how easy it is to create your own machine learning / classification model given a good data set.

What is Machine Box?

Machine Box puts state of the art machine learning capabilities into Docker containers so developers like you can easily incorporate natural language processing, facial detection, object recognition, etc. into your own apps very quickly.

The boxes are built for scale, so when your app really takes off just add more boxes horizontally, to infinity and beyond. Oh, and it’s way cheaper than any of the cloud services (and they might be better)… and your data doesn’t leave your infrastructure.

 

By: Aaron Edell