Language classification using Machine Learning in PHP

By Daniel Liljeberg

So, like many of us I decided to dabble a bit in Machine Learning (ML) and took a short course in the subject. One of the parts of the course was to create a small ML project. I wanted to try and make something practical out of it though and looked around at work to see what potential issues I could try to solve using ML.

Like any large company we do business basically over the entire globe. Business generates support requests and my first idea was to estimate the expected time a case would take to solve based on different parameters of the case. This could then be used to allocate staffing, identify specific types of cases that would benefit from implementing better solutions etc. Having limited time though and not access to enough data about the cases I was forced to abandon this since the only thing I was able to easily get at the time was basically the subject line, email of the sender and a few other pieces. Not enough to generate a near enough robust module for me to trust the estimates it would produce.

So I looked at it again, what other “problems” do our support tickets present? Well, since many of the support requests are sent to the initial email address that address receives support requests in all manner of languages. Sure, we can manually handle that and re-assign to support personnel in the country in question or ask everyone that submits a request to re-submit in English. But one takes time and adds the risk of requests being left for a time before being re-assigned and the other had already been tested without much success. Partly because people traditionally had already been used to receiving support in their native language and just because our company grew it made little sense that this would go away.

So, what if we could identify the language and then use that to redirect the request to a group of people who know the language in question and can provide support in that language?

Many ML examples are written in, Python and so was a major part of the course. But I decided to see if I could write my project in PHP. I do most of my development using C++ but when it comes to web-development and scripting I have used PHP for a long time. So I wanted to see how easy it would be to use PHP and it would also enable me to easily use what I created in my own PHP projects. A quick look around and I found PHP-ML. A Machine Learning library for PHP. It lacks some of the stuff you find in Scikit-learn for instance, but it seems to be in active development and adding features so things like Convolutional Neural Networks that are not in the library today will probably make it in with time.

Choosing classifier

Since I would be working with text and the goal was to classify which language a given text belonged to a classifier like Naive Bayes or SVC felt like the best options.

In order to facilitate helping new going down the ML rabbit hole Scikit-learn.org has a good map for choosing the correct classifier and that affirmed my initial pick.

I decided to test both Naive Bayes and SVC since I felt Naive Bayes would be faster to train, but SVC might result in better accuracy.

Data collection and pre-processing

The first obstacle in creating any ML model is to get the data to train on. If I would like to be able to classify a set of languages I clearly needed sample sentences from these languages to train on. But collecting a large set of sentences in each language would be quite time consuming and time was something I didn’t have, not to mention I didn’t speak the majority of these languages. I decided to go another, more rudimentary route. Instead of manually translating a list of sentences or finding unique lists of sentences I decided to let Google do the job for me.

I started by collecting a set of sentences in English, using that as my base sample set. I acquired these were basically by Googling “English sentences for kids”, “English sentences for adults” etc and compiling a list of sentences that I could use as my base. The first issue I found was that many of the sentences included very uniquely English sayings, which could probably cause issues since they are not to be taken literally. But I cleaned up the most extreme ones and ended up with 1635 English sentences.

If this was going into the actual production environment, I would vet the base data-set much more than this since that is the base for everything to come. But for my initial proof-of-concept it would have to do.

After this, I looked up which languages Google supported.

$languages = [
    'af' => 'Afrikaans',
    'sq' => 'Albanian',
    'ar' => 'Arabic',
    'az' => 'Azerbaijani',
    'eu' => 'Basque',
    'bn' => 'Bengali',
    'be' => 'Belarusian',
    'bg' => 'Bulgarian',
    'ca' => 'Catalan',
    'zh-CN' => 'Chinese Simplified',
    'zh-TW' => 'Chinese Traditional',
    'hr' => 'Croatian',
    'cs' => 'Czech',
    'da' => 'Danish',
    'nl' => 'Dutch',
    'en' => 'English',
    'eo' => 'Esperanto',
    'et' => 'Estonian',
    'tl' => 'Filipino',
    'fi' => 'Finnish',
    'fr' => 'French',
    'gl' => 'Galician',
    'ka' => 'Georgian',
    'de' => 'German',
    'el' => 'Greek',
    'gu' => 'Gujarati',
    'ht' => 'Haitian Creole',
    'iw' => 'Hebrew',
    'hi' => 'Hindi',
    'hu' => 'Hungarian',
    'is' => 'Icelandic',
    'id' => 'Indonesian',
    'ga' => 'Irish',
    'it' => 'Italian',
    'ja' => 'Japanese',
    'kn' => 'Kannada',
    'ko' => 'Korean',
    'la' => 'Latin',
    'lv' => 'Latvian',
    'lt' => 'Lithuanian',
    'mk' => 'Macedonian',
    'ms' => 'Malay',
    'mt' => 'Maltese',
    'no' => 'Norwegian',
    'fa' => 'Persian',
    'pl' => 'Polish',
    'pt' => 'Portuguese',
    'ro' => 'Romanian',
    'ru' => 'Russian',
    'sr' => 'Serbian',
    'sk' => 'Slovak',
    'sl' => 'Slovenian',
    'es' => 'Spanish',
    'sw' => 'Swahili',
    'sv' => 'Swedish',
    'ta' => 'Tamil',
    'te' => 'Telugu',
    'th' => 'Thai',
    'tr' => 'Turkish',
    'uk' => 'Ukrainian',
    'ur' => 'Urdu',
    'vi' => 'Vietnamese',
    'cy' => 'Welsh',
    'yi' => 'Yiddish'
];

My idea was to make http requests to translate.google.com translating the English sentences to each of the other languages. A quick search showed a small library doing exactly what I had in mind already existed so instead of writing it from scratch I decided to use it. https://github.com/Stichoza/google-translate-php.

With it you can simply call something like

$gt = new GoogleTranslate('en', 'sv');
$translatedString = $gt->translate("Hello World");

Making it all come together

If this was going into a system or library I would have made the LanguageClassifier a class that encapsulated all the functionality and was easy to use through it’s public interface. But I decided to go with a simple script intended to be used stand alone from a terminal, printing information about what it is doing during execution, for my demonstration.

I had one file sentences.txt that held the English sentences to use as a base. languagedatasset.ser which contained a serialized array which looked something like this

[
	'en' => ['xxxxxx' => 'Hello World', 'yyyyyy' => 'I like coffee in the morning'],
	'sv' => ['xxxxxx' => 'Goddag världen' 'yyyyyy' => 'Jag tycker om kaffe på morgonen'],
	...
]

where xxxxxx, yyyyyy etc where checksums of the original English sentence in order to be able to map a sentence to each of he languages. The model trained on the data set would then be stored in a file called model.dat.

The workflow of the script was something like this.

/*
 * - If sentences.txt holding base sentences in English exists
 *      - If languagedataset.ser does not exist
 *          - Setup inital english sentences from sentences.txt
 * - If sentences.txt contains new sentences or have removed sentences
 *      - Update english sentences in dataset
 * - For each language
 *      - Check if english sentence exist that is missing for language
 *          - Translate each missing sentence using Google Translate
 *      - Store updated languagedataset.ser
 *
 * - If model.dat already exists or we have to retrain due to changed dataset
 *      - Transform format from languagedataset.ser to an ArrayDataset, train 
 *        and check accurcy
 *      - Save model.dat
 * - Else load mode.dat
 * - Classify language of sentences passed
 */

The result is a script that can take a list of strings and spit out predictions regarding which language each string is written in.

C:\tools\php72\php.exe E:\Dropbox\Projects\php-ml\languageClassification.php "vad heter du?" "El toro" "Wie viel kostet dieser Computer?" "Jutro będzie piękna pogoda" "Domani sarà bel tempo" "Morgen zal het prachtig weer zijn" "Huomenna tulee kaunis sää" "Mon ordinateur ne démarre pas" "min computer starter ikke, og det gør mig skør" "Goedemorgen, Graag ontvang ik de licentiefiles voor de meters zoals in de bijlage genoemd. User name: [email protected] Company: Foo Bar klimaattechniek Customer number: 0123456"
Dataset up to date
Loading model... Done
array(10) {
  ["vad heter du?"]=>
  string(2) "sv"
  ["El toro"]=>
  string(2) "es"
  ["Wie viel kostet dieser Computer?"]=>
  string(2) "de"
  ["Jutro będzie piękna pogoda"]=>
  string(2) "pl"
  ["Domani sarà bel tempo"]=>
  string(2) "it"
  ["Morgen zal het prachtig weer zijn"]=>
  string(2) "nl"
  ["Huomenna tulee kaunis sää"]=>
  string(2) "fi"
  ["Mon ordinateur ne démarre pas"]=>
  string(2) "fr"
  ["min computer starter ikke, og det gør mig skør"]=>
  string(2) "da"
  ["Goedemorgen, Graag ontvang ik de licentiefiles voor de meters zoals in de bijlage genoemd. User name: [email protected] Company: Foo Bar klimaattechniek Customer number: 0123456"]=>
  string(2) "nl"
}

Test results

Due to time constraints, instead of using all the languages that my program supported I decided to go with a subset. This allowed me to run several tests of different sizes of the training set to see how that affected performance.

I decided to go with eleven different languages and wanted to have a few that were somewhat “similar” to make it a bit harder for the classifier, so I include Swedish, Norwegian and Danish.

The complete list of tested languages were:

Danish, Dutch, English, Finnish, French, German, Italian, Norwegian, Polish, Spanish and Swedish.

The two learning algorithms were run using sample sizes ranging from 100 to 600 sentences. These sentences were all translated into the different languages. So, the sample set of 100 includes 100 sentences for each language. So, the actual sample sizes ranged from 1100 to 6600 sentences.

Of these, a random selection of 90% were used as the training set and the remaining 10% as the test set. Naive Bayes, as expected, was much faster to train. Time elapsed ofc depends on the hardware used during training, but it’s fair to say that Naive Bayes using 600 sentences per language was almost as fast as SVC at 100 sentences per language. Where Naive Bayes always was a matter of counting seconds while training, SVC quickly moved into counting minutes.

In the table we can see the accuracy results for different amounts of sentences per language used to train on and different variables used for the SVC classifier. Variants of Naive Bayes aren’t currently supported by PHP-ML so only the basic version was tested.

Since the sentences selected are random each time and the quality of each sentence and its corresponding translations will be better for some than others, variances in reported accuracy is to be expected even for runs with identical settings.

From the table we can see that Naive Bayes gives a result similar to SVC at much greater training rate, but SVC edges ahead if one tweaks the parameters and takes the win with a maximum score of ~94% accuracy.

The more sentences used, the longer it takes to train and the larger the resulting model ends up being. For production purpose I would aim for a reasonable middle ground. A large model takes longer to load, can result in more time being needed for predictions etc.

Source code

A script implementing the above ideas can be found here: https://github.com/inquam/php-ml-language-classification