Classification de prénoms en genre (masculin/féminin) 🇫🇷

Xiaoou WANG

Cet exemple très simple issu du livre de Steven Bird vous donne une idée sur à quoi ressemble le boulot d’un ingénieur junior en Machine Learning.

La tâche consiste à entraîner un classifieur bayésien pour prédire le genre d’un prénom.

Sélection de features

On commence par prendre la dernière lettre d’un prénom comme feature et la stocker dans un dictionnaire.

[3]:
#! creation de last latter comme feature

import nltk
from nltk.corpus import names
import random
random.seed(13)
def gender_features(word):
     return {'last_letter': word[-1]}

print("La dernière lettre du pronom Shrek est")
gender_features('Shrek')
La dernière lettre du prnom Shrek est
[3]:
{'last_letter': 'k'}

Mise en forme du corpus

Le corpus provient de nltk. Ici on crée une liste de tuples grâce à quelques méthodes intégrées dans nltk.corpus. Notez l’emploi de list comprehension ici pour rendre le code plus concis tout en gardant la lisibilité.

[6]:
#! creation de datasets

labeled_names = ([(name, 'male') for name in names.words('male.txt')] +
 [(name, 'female') for name in names.words('female.txt')])

random.shuffle(labeled_names)
print("Un échantillon du corpus")
labeled_names[:10]
Un échantillon du corpus
[6]:
[('Mariam', 'female'),
 ('Marjorie', 'female'),
 ('Jasmin', 'female'),
 ('Welbie', 'male'),
 ('Modesty', 'female'),
 ('Kanya', 'female'),
 ('Michale', 'male'),
 ('Antonina', 'female'),
 ('Beulah', 'female'),
 ('Hazel', 'female')]

Création des corpus train/test

Ici on applique à la fonction de la section 1 à tous les noms du corpus. Les 500 premiers samples sont mis à l’écart pour servir de test.

[10]:
#! creation de paire feature/label

featuresets = [(gender_features(n), gender) for (n, gender) in labeled_names]
[11]:
#! creation de train et test

train_set, test_set = featuresets[500:], featuresets[:500]

Première classification

La précision est autour de 75.2%. On liste les features les plus utiles pour étudier quels sont les problèmes potentiels. Le likely ratio male : female signifie la probabilité exacte qu’un prénom particulier soit masculin/féminin en fonction de sa dernière lettre (donc feature).

[12]:
classifier = nltk.NaiveBayesClassifier.train(train_set)
#! classify
classifier.classify(gender_features('Neo'))
classifier.classify(gender_features('Trinity'))
print(nltk.classify.accuracy(classifier, test_set))
classifier.show_most_informative_features(5)
#! see likely ratio
0.752
Most Informative Features
             last_letter = 'k'              male : female =     45.8 : 1.0
             last_letter = 'a'            female : male   =     33.0 : 1.0
             last_letter = 'f'              male : female =     15.3 : 1.0
             last_letter = 'p'              male : female =     11.2 : 1.0
             last_letter = 'v'              male : female =     10.5 : 1.0

Ajout de features et problème d’overfitting

Si vous ajoutez trop de features, le modèle risque d’être trop adapté à tes données et se généralise mal sur des données non vues. Cela s’appelle overfitting et survient souvent quand le corpus est petit, ce qui est le cas ici.

Let’s add features

Dans un premier temps, nous allons essayer d’ajouter plein de features.

Examinons la fonction gender_features2. Les features sont :

  • La première et la dernière lettre

  • Un booléen indiquant si une lettre de l’ensemble a-z est présent dans le prénom

  • Un integer indiquant le nombre d’occurrences de cette lettre

Donc clairement nous y mettons tout le paquet…

[13]:
#! add features
def gender_features2(name):
    features = {"first_letter": name[0].lower(), "last_letter": name[-1].lower()}
    for letter in 'abcdefghijklmnopqrstuvwxyz':
        features["count({})".format(letter)] = name.lower().count(letter)
        features["has({})".format(letter)] = (letter in name.lower())
    return features

demo = gender_features2("john")
random.sample(demo.items(),5)
[13]:
[('count(i)', 0),
 ('count(h)', 1),
 ('count(x)', 0),
 ('has(a)', False),
 ('has(h)', True)]

La précision augmente

Avec ce nouveau featureset, la précision est montée de 75.2% à 77.4%.

[14]:
featuresets = [(gender_features2(n), gender) for (n, gender) in labeled_names]
train_set, test_set = featuresets[500:], featuresets[:500]
classifier = nltk.NaiveBayesClassifier.train(train_set)
print(nltk.classify.accuracy(classifier, test_set))

#! 0.752 vs 0.774
0.774

Feature engineering via l’analyse des erreurs

Ici nous expliquons le processus de feature engineering qui consiste à analyser les erreurs de la machine sur la base desquelles on filtre/supprime/créé des features.

L’intelligence artificielle n’est finalement pas si artificielle, non ? (Bon j’avoue que c’est pas du deep learning, mais quand même)

Notons ici la création de ‘devset’. Cette répartition en 3 sets est canonique en Machine Learning. On utilise le train pour entraîner le modèle, le devset pour ajuster ce modèle. Enfin le test ne doit être utilisé que pour l’évaluation finale.

La précision initiale (avant le feature engineering) est donc 76%.

[15]:
train_names = labeled_names[1500:]
devtest_names = labeled_names[500:1500]
test_names = labeled_names[:500]
[16]:
#! use devtest
train_set = [(gender_features(n), gender) for (n, gender) in train_names]
devtest_set = [(gender_features(n), gender) for (n, gender) in devtest_names]
test_set = [(gender_features(n), gender) for (n, gender) in test_names]
classifier = nltk.NaiveBayesClassifier.train(train_set)

print(nltk.classify.accuracy(classifier, devtest_set))
0.76

Premières hypothèses sur les erreurs

Analysons les erreurs affichées ci-dessous.

Les pronoms terminés par yn tendent à être féminins, alors que ceux terminés par n tendent à être masculins. Du coup deux règles serait meilleures qu’une seule.

Ca semble être le même principe pour les pronoms terminés par h qui sont principalement féminins et ceux terminés par ch qui ont tendance à être masculins

[17]:
errors = []
for (name, tag) in devtest_names:
     guess = classifier.classify(gender_features(name))
     if guess != tag:
         errors.append( (tag, guess, name) )

for (tag, guess, name) in sorted(errors):
     print('correct={:<8} guess={:<8s} name={:<30}'.format(tag, guess, name))
correct=female   guess=male     name=Aeriel
correct=female   guess=male     name=Aeriell
correct=female   guess=male     name=Allis
correct=female   guess=male     name=Allsun
correct=female   guess=male     name=Allyn
correct=female   guess=male     name=Allys
correct=female   guess=male     name=Amargo
correct=female   guess=male     name=Amber
correct=female   guess=male     name=Anne-Mar
correct=female   guess=male     name=Aurel
correct=female   guess=male     name=Avril
correct=female   guess=male     name=Barb
correct=female   guess=male     name=Beatriz
correct=female   guess=male     name=Beilul
correct=female   guess=male     name=Calypso
correct=female   guess=male     name=Cameo
correct=female   guess=male     name=Carlin
correct=female   guess=male     name=Carol
correct=female   guess=male     name=Carol-Jean
correct=female   guess=male     name=Caron
correct=female   guess=male     name=Caryl
correct=female   guess=male     name=Cat
correct=female   guess=male     name=Ceil
correct=female   guess=male     name=Charin
correct=female   guess=male     name=Charleen
correct=female   guess=male     name=Charlott
correct=female   guess=male     name=Charmian
correct=female   guess=male     name=Charo
correct=female   guess=male     name=Christal
correct=female   guess=male     name=Christel
correct=female   guess=male     name=Cleo
correct=female   guess=male     name=Corliss
correct=female   guess=male     name=Cris
correct=female   guess=male     name=Cristabel
correct=female   guess=male     name=Cybill
correct=female   guess=male     name=Dael
correct=female   guess=male     name=Daloris
correct=female   guess=male     name=Darb
correct=female   guess=male     name=Del
correct=female   guess=male     name=Delores
correct=female   guess=male     name=Dian
correct=female   guess=male     name=Doloritas
correct=female   guess=male     name=Dorcas
correct=female   guess=male     name=Doreen
correct=female   guess=male     name=Dorian
correct=female   guess=male     name=Estell
correct=female   guess=male     name=Esther
correct=female   guess=male     name=Felicdad
correct=female   guess=male     name=Gael
correct=female   guess=male     name=Gilligan
correct=female   guess=male     name=Gladys
correct=female   guess=male     name=Glen
correct=female   guess=male     name=Glynis
correct=female   guess=male     name=Greer
correct=female   guess=male     name=Grissel
correct=female   guess=male     name=Heather
correct=female   guess=male     name=Helen
correct=female   guess=male     name=Hildegaard
correct=female   guess=male     name=Ingeborg
correct=female   guess=male     name=Iseabal
correct=female   guess=male     name=Jaclin
correct=female   guess=male     name=Janel
correct=female   guess=male     name=Jen
correct=female   guess=male     name=Jenifer
correct=female   guess=male     name=Jo-Ann
correct=female   guess=male     name=Jolyn
correct=female   guess=male     name=Jolynn
correct=female   guess=male     name=Joyan
correct=female   guess=male     name=Kaitlynn
correct=female   guess=male     name=Karilynn
correct=female   guess=male     name=Kass
correct=female   guess=male     name=Kathlin
correct=female   guess=male     name=Kristan
correct=female   guess=male     name=Kristen
correct=female   guess=male     name=Lilian
correct=female   guess=male     name=Lurleen
correct=female   guess=male     name=Lynnett
correct=female   guess=male     name=Madlen
correct=female   guess=male     name=Margot
correct=female   guess=male     name=Margret
correct=female   guess=male     name=Mariam
correct=female   guess=male     name=Marie-Ann
correct=female   guess=male     name=Mariel
correct=female   guess=male     name=Marilyn
correct=female   guess=male     name=Maris
correct=female   guess=male     name=Mead
correct=female   guess=male     name=Meg
correct=female   guess=male     name=Megen
correct=female   guess=male     name=Meggan
correct=female   guess=male     name=Meridel
correct=female   guess=male     name=Mildred
correct=female   guess=male     name=Moll
correct=female   guess=male     name=Nell
correct=female   guess=male     name=Noellyn
correct=female   guess=male     name=Peg
correct=female   guess=male     name=Persis
correct=female   guess=male     name=Phil
correct=female   guess=male     name=Piper
correct=female   guess=male     name=Quinn
correct=female   guess=male     name=Robbin
correct=female   guess=male     name=Rosabel
correct=female   guess=male     name=Rosaleen
correct=female   guess=male     name=Rosalyn
correct=female   guess=male     name=Sal
correct=female   guess=male     name=Sara-Ann
correct=female   guess=male     name=Shannon
correct=female   guess=male     name=Sharyl
correct=female   guess=male     name=Shell
correct=female   guess=male     name=Starlin
correct=female   guess=male     name=Theo
correct=female   guess=male     name=Tiff
correct=female   guess=male     name=Vivyan
correct=female   guess=male     name=Willow
correct=female   guess=male     name=Willyt
correct=female   guess=male     name=Yehudit
correct=male     guess=female   name=Abdullah
correct=male     guess=female   name=Amory
correct=male     guess=female   name=Angie
correct=male     guess=female   name=Arne
correct=male     guess=female   name=Ash
correct=male     guess=female   name=Aube
correct=male     guess=female   name=Aubrey
correct=male     guess=female   name=Augie
correct=male     guess=female   name=Baillie
correct=male     guess=female   name=Bartholemy
correct=male     guess=female   name=Bary
correct=male     guess=female   name=Benjy
correct=male     guess=female   name=Berke
correct=male     guess=female   name=Berkley
correct=male     guess=female   name=Boniface
correct=male     guess=female   name=Boyce
correct=male     guess=female   name=Bruce
correct=male     guess=female   name=Carleigh
correct=male     guess=female   name=Chevy
correct=male     guess=female   name=Clancy
correct=male     guess=female   name=Cobbie
correct=male     guess=female   name=Cole
correct=male     guess=female   name=Constantine
correct=male     guess=female   name=Courtney
correct=male     guess=female   name=Davidde
correct=male     guess=female   name=Dudley
correct=male     guess=female   name=Duffie
correct=male     guess=female   name=Durante
correct=male     guess=female   name=Eddie
correct=male     guess=female   name=Eddy
correct=male     guess=female   name=Erny
correct=male     guess=female   name=Fairfax
correct=male     guess=female   name=Felice
correct=male     guess=female   name=Filipe
correct=male     guess=female   name=Fonsie
correct=male     guess=female   name=Freddie
correct=male     guess=female   name=Gerry
correct=male     guess=female   name=Godfrey
correct=male     guess=female   name=Guthrie
correct=male     guess=female   name=Guthry
correct=male     guess=female   name=Haleigh
correct=male     guess=female   name=Hamish
correct=male     guess=female   name=Harvey
correct=male     guess=female   name=Hersh
correct=male     guess=female   name=Hodge
correct=male     guess=female   name=Hy
correct=male     guess=female   name=Isadore
correct=male     guess=female   name=Jean-Pierre
correct=male     guess=female   name=Jeffery
correct=male     guess=female   name=Jeromy
correct=male     guess=female   name=Jesse
correct=male     guess=female   name=Johnny
correct=male     guess=female   name=Josh
correct=male     guess=female   name=Jule
correct=male     guess=female   name=Julie
correct=male     guess=female   name=Kelly
correct=male     guess=female   name=Kelsey
correct=male     guess=female   name=Kory
correct=male     guess=female   name=Lemmy
correct=male     guess=female   name=Lenny
correct=male     guess=female   name=Lesley
correct=male     guess=female   name=Luke
correct=male     guess=female   name=Martie
correct=male     guess=female   name=Marty
correct=male     guess=female   name=Max
correct=male     guess=female   name=Mika
correct=male     guess=female   name=Mischa
correct=male     guess=female   name=Mitch
correct=male     guess=female   name=Moise
correct=male     guess=female   name=Monte
correct=male     guess=female   name=Morly
correct=male     guess=female   name=Morty
correct=male     guess=female   name=Murdoch
correct=male     guess=female   name=Mustafa
correct=male     guess=female   name=Neale
correct=male     guess=female   name=Neddy
correct=male     guess=female   name=Obie
correct=male     guess=female   name=Paddie
correct=male     guess=female   name=Pascale
correct=male     guess=female   name=Pepe
correct=male     guess=female   name=Prentice
correct=male     guess=female   name=Quigly
correct=male     guess=female   name=Rabi
correct=male     guess=female   name=Rafe
correct=male     guess=female   name=Ralph
correct=male     guess=female   name=Rawley
correct=male     guess=female   name=Reese
correct=male     guess=female   name=Rey
correct=male     guess=female   name=Ronny
correct=male     guess=female   name=Rourke
correct=male     guess=female   name=Rudie
correct=male     guess=female   name=Rutledge
correct=male     guess=female   name=Say
correct=male     guess=female   name=Shaine
correct=male     guess=female   name=Stacy
correct=male     guess=female   name=Stanley
correct=male     guess=female   name=Tabbie
correct=male     guess=female   name=Tally
correct=male     guess=female   name=Tarrance
correct=male     guess=female   name=Temple
correct=male     guess=female   name=Thayne
correct=male     guess=female   name=Thorpe
correct=male     guess=female   name=Torey
correct=male     guess=female   name=Torre
correct=male     guess=female   name=Trace
correct=male     guess=female   name=Tracey
correct=male     guess=female   name=Udale
correct=male     guess=female   name=Uri
correct=male     guess=female   name=Vassily
correct=male     guess=female   name=Vijay
correct=male     guess=female   name=Vinnie
correct=male     guess=female   name=Waite
correct=male     guess=female   name=Wallache
correct=male     guess=female   name=Waverley
correct=male     guess=female   name=Westbrooke
correct=male     guess=female   name=Westleigh
correct=male     guess=female   name=Willie
correct=male     guess=female   name=Willy
correct=male     guess=female   name=Yancey
correct=male     guess=female   name=Yancy
correct=male     guess=female   name=Yehudi
correct=male     guess=female   name=Yuri
correct=male     guess=female   name=Zackariah
correct=male     guess=female   name=Zechariah
correct=male     guess=female   name=Zollie

Intégration des nouveaux features dans le classifieur

Il semble bénéfique d’ajuster nos features en incluant les deux dernières lettres.

Et youpi ! La précision est montée de 76% à 78.1%. C’est pas mal non ?

[19]:
def gender_features(word):
     return {'suffix1': word[-1:],
             'suffix2': word[-2:]}

train_set = [(gender_features(n), gender) for (n, gender) in train_names]
devtest_set = [(gender_features(n), gender) for (n, gender) in devtest_names]
classifier = nltk.NaiveBayesClassifier.train(train_set)
print(nltk.classify.accuracy(classifier, devtest_set))

# 0.76 -> 0.781
0.781

Importance de nouveaux splits train/dev

Nous pouvons donc réitérer ce processus d’analyse d’erreurs et de feature engineering jusqu’à obtenir une performance satisfaisante.

Attention !:D

Il vaut mieux faire un nouveau split train/dev à chaque fois qu’on intègre/supprime des features pour éviter l’overfitting.

Conclusions

Bravo d’avoir fini l’article !

Ce qu’il faut retenir :

  1. Le machine learning traditionnel repose pas mal sur l’analyse humaine. Comme vous avez vu ici, l’analyse des erreurs de classification aide beaucoup l’intelligence “artificielle”.

  2. Il est important de faire un split train/dev/test pour éviter que le modèle soit overfitted. Dans la même ligne de pensée il est aussi conseillé de garder un nombre raisonnable de features.

  3. Vous l’aurez compris. L’analyse d’erreurs (feature engineering) et le compromis entre performance et généralisabilité font du machine learning un art qui nécessite un savoir-faire qui s’acquiert au fil des ans.

Reference