{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Everything is translation (Build a chatbot using attention and self-attention in fairseq)\n", "\n", "[Xiaoou Wang](https://scholar.google.fr/citations?user=vKAMMpwAAAAJ&hl=en)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Build a simple chatbot using NMT model\n", "\n", "Chatbots are quite popular these days. Modern models can even tell corny jokes (like [Google's Meena](https://ai.googleblog.com/2020/01/towards-conversational-agent-that-can.html)).\n", "\n", "If you followed [the tutorial on build Neural translation model](./06_machine_translation), you have actually learned the basics needed for building a chatbot. Surprising? Maybe not :) Actually, in a chatbot setting, it's common to consider the question as the text to be \"translated\" and the answer as an translation.\n", "\n", "The bitext is thus a file consisting of lines separated often by `\\t`: the first item being the question and the last item the response." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "For the sake of simplicity, I'm not going to review the essential concepts again, do read the previous tutorial and come back here if there are some terms appearing cryptic to you.\n", "\n", "Just like NMT task, it's common to have two files, one for the questions and another for the answers. If we use the translation terminology, the questions are `fr` (foreign) and the answers are `en` (English, since early translation systems mostly translate to English)." ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Hey .\n", "Hi .\n", "Are you a fan of the Star Wars series ?\n", "Yeah love them .\n", "Me too .\n", "Cool .\n", "Who is your least favorite character ?\n", "Without a doubt Jar Jar Binks .\n", "Oh man , same here .\n", "I like this new trend of taking animated Disney movies and making them into live action movies .\n" ] } ], "source": [ "!head data/dialog/selfdialog.valid.tok.fr" ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Hi .\n", "Are you a fan of the Star Wars series ?\n", "Yeah love them .\n", "Me too .\n", "Cool .\n", "Who is your least favorite character ?\n", "Without a doubt Jar Jar Binks .\n", "Oh man , same here .\n", "Absolutely hate him .\n", "Me , too . I trust Disney to do it right .\n" ] } ], "source": [ "!head data/dialog/selfdialog.valid.tok.en" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The preprocessing step (convert raw texts to a binary format) is identical to the NMT task:" ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "fairseq-preprocess \\\n", "--source-lang fr \\\n", "--target-lang en \\\n", "--trainpref data/chatbot/selfdialog.train.tok \\\n", "--validpref data/chatbot/selfdialog.valid.tok \\\n", "--destdir data/chatbot-cp \\\n", "--thresholdsrc 3 \\\n", "--thresholdtgt 3" ] } ], "source": [ "!cat ./src/dialog/preprocess.sh" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The training part is very similar too. It is worth noting that although the building block used here is `lstm`, the whole architecture actually comprises an attention mechnism." ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "fairseq-train \\ data/chatbot-bin \\\n", "--arch lstm \\\n", "--share-decoder-input-output-embed \\\n", "--optimizer adam \\\n", "--lr 1.0e-3 \\\n", "--max-tokens 4096 \\\n", "--save-dir data/chatbot-ckpt" ] } ], "source": [ "!head src/dialog/train.sh" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The inference part is the same. Here we used a beam width of 5 (if 1 then it is equivalent to greedy search)." ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "fairseq-interactive \\ data/chatbot-bin \\\n", "--path data/chatbot-ckpt/checkpoint_best.pt \\\n", "--beam 5 \\\n", "--source-lang fr \\\n", "--target-lang en" ] } ], "source": [ "!head src/dialog/inference.sh" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Et voilà! You built your first dialogue system :)\n", "![](img/07_chatbot/2022-03-07-13-27-55.png)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## So what is attention\n", "\n", "Remember in [the previous tutorial](./06_machine_translation), we briefly talked about the main drawback of a vanilla seq2seq2 model. The input (a sentence or a text) is encoded into a single vector regardless of its length, resulting in a loss of information.\n", "\n", "If we think about how a human translates a sentence, it's often the case that human translators don't remember the sentence and use a static internal representation to produce translations. Actually, he will often look back to the original text in order to focus his attention on the relevant parts at each time step.\n", "\n", "Suppose the text to translate is \"Le chat cherche de la nourriture désespérément\" in French, a human translator would start by translating \"the cat\", focusing his attention on \"Le chat\". Then he will maybe put his focus on the word \"désespérément\" and output \"desparately\". Finally he will shift his attention back to \"cherche de la nourriture\" meaning \"look for something to eat\".\n", "\n", "The approach of searching for the relevant part at each time step is not possible using the single vector encoding. So instead, Bahdanau et al. comes up with a clever solution which consists in taking all the hidden states of a RNN structure. The input now is a list of vectors (each vector represents a token). Besides, at each time step of the decoder (translation), a context vector is compared against the vectors from the encoder to detect the most useful tokens to pay attention to. So for each time step of translation, the decoder has access to a different, weighted representation of the input text!\n", "\n", "For more details, see\n", "\n", "Bahdanau, D., Cho, K., & Bengio, Y. (2016). Neural Machine Translation by Jointly Learning to Align and Translate. ArXiv:1409.0473 [Cs, Stat]. http://arxiv.org/abs/1409.0473" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## And the self-attention ?\n", "\n", "The attention mechanism is designed to tackle the issue of the single vector representation. However, it is still facing another problem: long-range dependence. As a sentence's length increases, it would be more and more difficult to \"remember\" relations between different elements inside a sentence.\n", "\n", "For instance, in the sentence \"The neighbor is not here today because of the vacation, that's why his cat keeps meowing.\". It is very difficult for a RNN unit to know that `his` is related to `the neighbor` since too many words lie between these two phrases. However, a simple rule like `his` is always related to the nearest human pronoun should work.\n", "\n", "In another words, this kind of `random access` is more efficient than the `linear access` way of RNNs.\n", "\n", "On a higher level, one could say that the context vector is the inputs themselves. So this form of attention is called `self-attention`, the use of the latter in encoder and decoder has led to a new neural network architecture called transformer. The use of transformer and transfer learning in a variety of nlp tasks has enabled researchers to achieve new scores on leadboards.\n", "\n", "In order to understand the implementation of `self-attention` (for example the exact architecture comprises multi-head attention aiming at capture richer linguistic representation). It may be useful to consult the two following sources:\n", "\n", "[The illustrated transformer](http://jalammar.github.io/illustrated-transformer/)\n", "\n", "[The annotated transformer](http://nlp.seas.harvard.edu/2018/04/03/attention.html)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## How to use self-attention in fairseq?\n", "\n", "The nice thing about `fairseq` is that it implements and provides already many sota model architectures including transformer. Also fairseq makes it really smooth to use multiple GPUs.\n", "\n", "To use the transformer architecture, just change the `--arch lstm` to `--arch transformer`. You can ignore the hyperparameters because the choice of the latters is irrelevant here.\n", "\n" ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "fairseq-train \\\n", "data/chatbot-bin \\\n", "--arch transformer \\\n", "--share-decoder-input-output-embed \\\n", "--optimizer adam --adam-betas '(0.9, 0.98)' --clip-norm 0.0 \\\n", "--lr 5e-4 --lr-scheduler inverse_sqrt --warmup-updates 4000 \\\n", "--dropout 0.3 --weight-decay 0.0 \\\n", "--criterion label_smoothed_cross_entropy --label-smoothing 0.1 \\\n", "--max-tokens 4096 \\\n", "--save-dir data/chatbot-cp-transformer" ] } ], "source": [ "!head src/dialog/train_transformer.sh" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Conclusions\n", "\n", "That's it :) You know how to build a simple chatbot right now. Next time we will talk about how to use NMT models to implement Grammatical Error Correction (GEC) and especially, how the whole architecture can be used to produce `noisy texts` (texts containing errors).\n", "\n", "Stay tuned :)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [] } ], "metadata": { "interpreter": { "hash": "40d3a090f54c6569ab1632332b64b2c03c39dcf918b08424e98f38b5ae0af88f" }, "kernelspec": { "display_name": "Python 3.9.7 ('base')", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.9.7" }, "orig_nbformat": 4 }, "nbformat": 4, "nbformat_minor": 2 }