Classification de prénoms en genre (masculin/féminin) 🇫🇷¶
Cet exemple très simple issu du livre de Steven Bird vous donne une idée sur à quoi ressemble le boulot d’un ingénieur junior en Machine Learning.
La tâche consiste à entraîner un classifieur bayésien pour prédire le genre d’un prénom.
Sélection de features¶
On commence par prendre la dernière lettre d’un prénom comme feature et la stocker dans un dictionnaire.
[3]:
#! creation de last latter comme feature
import nltk
from nltk.corpus import names
import random
random.seed(13)
def gender_features(word):
return {'last_letter': word[-1]}
print("La dernière lettre du pronom Shrek est")
gender_features('Shrek')
La dernière lettre du prnom Shrek est
[3]:
{'last_letter': 'k'}
Mise en forme du corpus¶
Le corpus provient de nltk
. Ici on crée une liste de tuples grâce à quelques méthodes intégrées dans nltk.corpus
. Notez l’emploi de list comprehension ici pour rendre le code plus concis tout en gardant la lisibilité.
[6]:
#! creation de datasets
labeled_names = ([(name, 'male') for name in names.words('male.txt')] +
[(name, 'female') for name in names.words('female.txt')])
random.shuffle(labeled_names)
print("Un échantillon du corpus")
labeled_names[:10]
Un échantillon du corpus
[6]:
[('Mariam', 'female'),
('Marjorie', 'female'),
('Jasmin', 'female'),
('Welbie', 'male'),
('Modesty', 'female'),
('Kanya', 'female'),
('Michale', 'male'),
('Antonina', 'female'),
('Beulah', 'female'),
('Hazel', 'female')]
Création des corpus train/test¶
Ici on applique à la fonction de la section 1
à tous les noms du corpus. Les 500 premiers samples sont mis à l’écart pour servir de test.
[10]:
#! creation de paire feature/label
featuresets = [(gender_features(n), gender) for (n, gender) in labeled_names]
[11]:
#! creation de train et test
train_set, test_set = featuresets[500:], featuresets[:500]
Première classification¶
La précision est autour de 75.2%. On liste les features les plus utiles pour étudier quels sont les problèmes potentiels. Le likely ratio male : female
signifie la probabilité exacte qu’un prénom particulier soit masculin/féminin en fonction de sa dernière lettre (donc feature).
[12]:
classifier = nltk.NaiveBayesClassifier.train(train_set)
#! classify
classifier.classify(gender_features('Neo'))
classifier.classify(gender_features('Trinity'))
print(nltk.classify.accuracy(classifier, test_set))
classifier.show_most_informative_features(5)
#! see likely ratio
0.752
Most Informative Features
last_letter = 'k' male : female = 45.8 : 1.0
last_letter = 'a' female : male = 33.0 : 1.0
last_letter = 'f' male : female = 15.3 : 1.0
last_letter = 'p' male : female = 11.2 : 1.0
last_letter = 'v' male : female = 10.5 : 1.0
Ajout de features et problème d’overfitting¶
Si vous ajoutez trop de features, le modèle risque d’être trop adapté à tes données et se généralise mal sur des données non vues. Cela s’appelle overfitting
et survient souvent quand le corpus est petit, ce qui est le cas ici.
Let’s add features¶
Dans un premier temps, nous allons essayer d’ajouter plein de features.
Examinons la fonction gender_features2
. Les features sont :
La première et la dernière lettre
Un booléen indiquant si une lettre de l’ensemble a-z est présent dans le prénom
Un integer indiquant le nombre d’occurrences de cette lettre
Donc clairement nous y mettons tout le paquet…
[13]:
#! add features
def gender_features2(name):
features = {"first_letter": name[0].lower(), "last_letter": name[-1].lower()}
for letter in 'abcdefghijklmnopqrstuvwxyz':
features["count({})".format(letter)] = name.lower().count(letter)
features["has({})".format(letter)] = (letter in name.lower())
return features
demo = gender_features2("john")
random.sample(demo.items(),5)
[13]:
[('count(i)', 0),
('count(h)', 1),
('count(x)', 0),
('has(a)', False),
('has(h)', True)]
La précision augmente¶
Avec ce nouveau featureset, la précision est montée de 75.2% à 77.4%.
[14]:
featuresets = [(gender_features2(n), gender) for (n, gender) in labeled_names]
train_set, test_set = featuresets[500:], featuresets[:500]
classifier = nltk.NaiveBayesClassifier.train(train_set)
print(nltk.classify.accuracy(classifier, test_set))
#! 0.752 vs 0.774
0.774
Feature engineering via l’analyse des erreurs¶
Ici nous expliquons le processus de feature engineering qui consiste à analyser les erreurs de la machine sur la base desquelles on filtre/supprime/créé des features.
L’intelligence artificielle n’est finalement pas si artificielle, non ? (Bon j’avoue que c’est pas du deep learning, mais quand même)
Notons ici la création de ‘devset’. Cette répartition en 3 sets est canonique en Machine Learning. On utilise le train pour entraîner le modèle, le devset pour ajuster ce modèle. Enfin le test ne doit être utilisé que pour l’évaluation finale.
La précision initiale (avant le feature engineering) est donc 76%.
[15]:
train_names = labeled_names[1500:]
devtest_names = labeled_names[500:1500]
test_names = labeled_names[:500]
[16]:
#! use devtest
train_set = [(gender_features(n), gender) for (n, gender) in train_names]
devtest_set = [(gender_features(n), gender) for (n, gender) in devtest_names]
test_set = [(gender_features(n), gender) for (n, gender) in test_names]
classifier = nltk.NaiveBayesClassifier.train(train_set)
print(nltk.classify.accuracy(classifier, devtest_set))
0.76
Premières hypothèses sur les erreurs¶
Analysons les erreurs affichées ci-dessous.
Les pronoms terminés par yn
tendent à être féminins, alors que ceux terminés par n
tendent à être masculins. Du coup deux règles serait meilleures qu’une seule.
Ca semble être le même principe pour les pronoms terminés par h
qui sont principalement féminins et ceux terminés par ch
qui ont tendance à être masculins
[17]:
errors = []
for (name, tag) in devtest_names:
guess = classifier.classify(gender_features(name))
if guess != tag:
errors.append( (tag, guess, name) )
for (tag, guess, name) in sorted(errors):
print('correct={:<8} guess={:<8s} name={:<30}'.format(tag, guess, name))
correct=female guess=male name=Aeriel
correct=female guess=male name=Aeriell
correct=female guess=male name=Allis
correct=female guess=male name=Allsun
correct=female guess=male name=Allyn
correct=female guess=male name=Allys
correct=female guess=male name=Amargo
correct=female guess=male name=Amber
correct=female guess=male name=Anne-Mar
correct=female guess=male name=Aurel
correct=female guess=male name=Avril
correct=female guess=male name=Barb
correct=female guess=male name=Beatriz
correct=female guess=male name=Beilul
correct=female guess=male name=Calypso
correct=female guess=male name=Cameo
correct=female guess=male name=Carlin
correct=female guess=male name=Carol
correct=female guess=male name=Carol-Jean
correct=female guess=male name=Caron
correct=female guess=male name=Caryl
correct=female guess=male name=Cat
correct=female guess=male name=Ceil
correct=female guess=male name=Charin
correct=female guess=male name=Charleen
correct=female guess=male name=Charlott
correct=female guess=male name=Charmian
correct=female guess=male name=Charo
correct=female guess=male name=Christal
correct=female guess=male name=Christel
correct=female guess=male name=Cleo
correct=female guess=male name=Corliss
correct=female guess=male name=Cris
correct=female guess=male name=Cristabel
correct=female guess=male name=Cybill
correct=female guess=male name=Dael
correct=female guess=male name=Daloris
correct=female guess=male name=Darb
correct=female guess=male name=Del
correct=female guess=male name=Delores
correct=female guess=male name=Dian
correct=female guess=male name=Doloritas
correct=female guess=male name=Dorcas
correct=female guess=male name=Doreen
correct=female guess=male name=Dorian
correct=female guess=male name=Estell
correct=female guess=male name=Esther
correct=female guess=male name=Felicdad
correct=female guess=male name=Gael
correct=female guess=male name=Gilligan
correct=female guess=male name=Gladys
correct=female guess=male name=Glen
correct=female guess=male name=Glynis
correct=female guess=male name=Greer
correct=female guess=male name=Grissel
correct=female guess=male name=Heather
correct=female guess=male name=Helen
correct=female guess=male name=Hildegaard
correct=female guess=male name=Ingeborg
correct=female guess=male name=Iseabal
correct=female guess=male name=Jaclin
correct=female guess=male name=Janel
correct=female guess=male name=Jen
correct=female guess=male name=Jenifer
correct=female guess=male name=Jo-Ann
correct=female guess=male name=Jolyn
correct=female guess=male name=Jolynn
correct=female guess=male name=Joyan
correct=female guess=male name=Kaitlynn
correct=female guess=male name=Karilynn
correct=female guess=male name=Kass
correct=female guess=male name=Kathlin
correct=female guess=male name=Kristan
correct=female guess=male name=Kristen
correct=female guess=male name=Lilian
correct=female guess=male name=Lurleen
correct=female guess=male name=Lynnett
correct=female guess=male name=Madlen
correct=female guess=male name=Margot
correct=female guess=male name=Margret
correct=female guess=male name=Mariam
correct=female guess=male name=Marie-Ann
correct=female guess=male name=Mariel
correct=female guess=male name=Marilyn
correct=female guess=male name=Maris
correct=female guess=male name=Mead
correct=female guess=male name=Meg
correct=female guess=male name=Megen
correct=female guess=male name=Meggan
correct=female guess=male name=Meridel
correct=female guess=male name=Mildred
correct=female guess=male name=Moll
correct=female guess=male name=Nell
correct=female guess=male name=Noellyn
correct=female guess=male name=Peg
correct=female guess=male name=Persis
correct=female guess=male name=Phil
correct=female guess=male name=Piper
correct=female guess=male name=Quinn
correct=female guess=male name=Robbin
correct=female guess=male name=Rosabel
correct=female guess=male name=Rosaleen
correct=female guess=male name=Rosalyn
correct=female guess=male name=Sal
correct=female guess=male name=Sara-Ann
correct=female guess=male name=Shannon
correct=female guess=male name=Sharyl
correct=female guess=male name=Shell
correct=female guess=male name=Starlin
correct=female guess=male name=Theo
correct=female guess=male name=Tiff
correct=female guess=male name=Vivyan
correct=female guess=male name=Willow
correct=female guess=male name=Willyt
correct=female guess=male name=Yehudit
correct=male guess=female name=Abdullah
correct=male guess=female name=Amory
correct=male guess=female name=Angie
correct=male guess=female name=Arne
correct=male guess=female name=Ash
correct=male guess=female name=Aube
correct=male guess=female name=Aubrey
correct=male guess=female name=Augie
correct=male guess=female name=Baillie
correct=male guess=female name=Bartholemy
correct=male guess=female name=Bary
correct=male guess=female name=Benjy
correct=male guess=female name=Berke
correct=male guess=female name=Berkley
correct=male guess=female name=Boniface
correct=male guess=female name=Boyce
correct=male guess=female name=Bruce
correct=male guess=female name=Carleigh
correct=male guess=female name=Chevy
correct=male guess=female name=Clancy
correct=male guess=female name=Cobbie
correct=male guess=female name=Cole
correct=male guess=female name=Constantine
correct=male guess=female name=Courtney
correct=male guess=female name=Davidde
correct=male guess=female name=Dudley
correct=male guess=female name=Duffie
correct=male guess=female name=Durante
correct=male guess=female name=Eddie
correct=male guess=female name=Eddy
correct=male guess=female name=Erny
correct=male guess=female name=Fairfax
correct=male guess=female name=Felice
correct=male guess=female name=Filipe
correct=male guess=female name=Fonsie
correct=male guess=female name=Freddie
correct=male guess=female name=Gerry
correct=male guess=female name=Godfrey
correct=male guess=female name=Guthrie
correct=male guess=female name=Guthry
correct=male guess=female name=Haleigh
correct=male guess=female name=Hamish
correct=male guess=female name=Harvey
correct=male guess=female name=Hersh
correct=male guess=female name=Hodge
correct=male guess=female name=Hy
correct=male guess=female name=Isadore
correct=male guess=female name=Jean-Pierre
correct=male guess=female name=Jeffery
correct=male guess=female name=Jeromy
correct=male guess=female name=Jesse
correct=male guess=female name=Johnny
correct=male guess=female name=Josh
correct=male guess=female name=Jule
correct=male guess=female name=Julie
correct=male guess=female name=Kelly
correct=male guess=female name=Kelsey
correct=male guess=female name=Kory
correct=male guess=female name=Lemmy
correct=male guess=female name=Lenny
correct=male guess=female name=Lesley
correct=male guess=female name=Luke
correct=male guess=female name=Martie
correct=male guess=female name=Marty
correct=male guess=female name=Max
correct=male guess=female name=Mika
correct=male guess=female name=Mischa
correct=male guess=female name=Mitch
correct=male guess=female name=Moise
correct=male guess=female name=Monte
correct=male guess=female name=Morly
correct=male guess=female name=Morty
correct=male guess=female name=Murdoch
correct=male guess=female name=Mustafa
correct=male guess=female name=Neale
correct=male guess=female name=Neddy
correct=male guess=female name=Obie
correct=male guess=female name=Paddie
correct=male guess=female name=Pascale
correct=male guess=female name=Pepe
correct=male guess=female name=Prentice
correct=male guess=female name=Quigly
correct=male guess=female name=Rabi
correct=male guess=female name=Rafe
correct=male guess=female name=Ralph
correct=male guess=female name=Rawley
correct=male guess=female name=Reese
correct=male guess=female name=Rey
correct=male guess=female name=Ronny
correct=male guess=female name=Rourke
correct=male guess=female name=Rudie
correct=male guess=female name=Rutledge
correct=male guess=female name=Say
correct=male guess=female name=Shaine
correct=male guess=female name=Stacy
correct=male guess=female name=Stanley
correct=male guess=female name=Tabbie
correct=male guess=female name=Tally
correct=male guess=female name=Tarrance
correct=male guess=female name=Temple
correct=male guess=female name=Thayne
correct=male guess=female name=Thorpe
correct=male guess=female name=Torey
correct=male guess=female name=Torre
correct=male guess=female name=Trace
correct=male guess=female name=Tracey
correct=male guess=female name=Udale
correct=male guess=female name=Uri
correct=male guess=female name=Vassily
correct=male guess=female name=Vijay
correct=male guess=female name=Vinnie
correct=male guess=female name=Waite
correct=male guess=female name=Wallache
correct=male guess=female name=Waverley
correct=male guess=female name=Westbrooke
correct=male guess=female name=Westleigh
correct=male guess=female name=Willie
correct=male guess=female name=Willy
correct=male guess=female name=Yancey
correct=male guess=female name=Yancy
correct=male guess=female name=Yehudi
correct=male guess=female name=Yuri
correct=male guess=female name=Zackariah
correct=male guess=female name=Zechariah
correct=male guess=female name=Zollie
Intégration des nouveaux features dans le classifieur¶
Il semble bénéfique d’ajuster nos features en incluant les deux dernières lettres.
Et youpi ! La précision est montée de 76% à 78.1%. C’est pas mal non ?
[19]:
def gender_features(word):
return {'suffix1': word[-1:],
'suffix2': word[-2:]}
train_set = [(gender_features(n), gender) for (n, gender) in train_names]
devtest_set = [(gender_features(n), gender) for (n, gender) in devtest_names]
classifier = nltk.NaiveBayesClassifier.train(train_set)
print(nltk.classify.accuracy(classifier, devtest_set))
# 0.76 -> 0.781
0.781
Importance de nouveaux splits train/dev¶
Nous pouvons donc réitérer ce processus d’analyse d’erreurs et de feature engineering jusqu’à obtenir une performance satisfaisante.
Attention !:D
Il vaut mieux faire un nouveau split train/dev à chaque fois qu’on intègre/supprime des features pour éviter l’overfitting.
Conclusions¶
Bravo d’avoir fini l’article !
Ce qu’il faut retenir :
Le machine learning traditionnel repose pas mal sur l’analyse humaine. Comme vous avez vu ici, l’analyse des erreurs de classification aide beaucoup l’intelligence “artificielle”.
Il est important de faire un split train/dev/test pour éviter que le modèle soit overfitted. Dans la même ligne de pensée il est aussi conseillé de garder un nombre raisonnable de features.
Vous l’aurez compris. L’analyse d’erreurs (feature engineering) et le compromis entre performance et généralisabilité font du machine learning un art qui nécessite un savoir-faire qui s’acquiert au fil des ans.