{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Parallelization in Python: a beginner’s guide (1, using map)\n", "\n", "[Xiaoou WANG](https://scholar.google.fr/citations?user=vKAMMpwAAAAJ&hl=en)\n", "\n", "Parallelization is very useful in a lot of daily tasks, however tutorials often begin with obscure explanations about multithreading and multiprocessing. Don’t get me wrong, these concepts are crucial in complex scenarios, however they are intimidating and unnecessary for beginners.\n", "\n", "Let’s use a code-first and example-driven approach to introduce parallelization in Python.\n", "\n", "Suppose you have 4 xml files zipped in .7z (a kind of compressed file) and you want to unzip them. \n", "\n", "Instinctively, you say to yourself that the best approach is to unzip all the files altogether instead of processing 1 by 1.\n", "\n", "The `altogether` way is parallelization.\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Check how many cores that I have on my computer\n", "\n", "Please note that I use `multiprocess` here instead of `multiprocessing` because the latter has some issue with Jupyter Notebook. See [here](https://stackoverflow.com/questions/41385708/multiprocessing-example-giving-attributeerror).\n", "\n", "We have 16 cpus :)" ] }, { "cell_type": "code", "execution_count": 18, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Number of processors: 16\n", "16\n" ] } ], "source": [ "import multiprocess as mp\n", "\n", "print(\"Number of processors: \", mp.cpu_count())\n", "\n", "import os\n", "\n", "workers = os.cpu_count()\n", "print(workers)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Get the filenames and have a peek at the size of each file\n", "\n", "In our case, their extensions are `.7z` and they are located in the `/Users/xiaoou/Downloads/frwac_7z/` directory. We can use `glob` to get the list of filenames. As you can see from the output, these files are quite huge and can take some time if we unzip them one by one." ] }, { "cell_type": "code", "execution_count": 19, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "{'/Users/xiaoou/Downloads/frwac_7z/frwac_subset_100M.7z': '144.86 mb', '/Users/xiaoou/Downloads/frwac_7z/frwac_subset_100Mcopy.7z': '144.86 mb', '/Users/xiaoou/Downloads/frwac_7z/frwac_subset_100Mcopy3.7z': '144.86 mb', '/Users/xiaoou/Downloads/frwac_7z/frwac_subset_100Mcopy2.7z': '144.86 mb'}\n" ] } ], "source": [ "import glob\n", "\n", "def get_fns(dir):\n", " return glob.glob(dir)\n", "\n", "def get_size(fn, unit=\"mb\"):\n", " if unit == \"mb\":\n", " return round(os.path.getsize(fn)/(1024*1024), 2)\n", "\n", "pattern = \"/Users/xiaoou/Downloads/frwac_7z/*.7z\"\n", "fns_7z = get_fns(pattern)\n", "sizes = {x: (str(get_size(x))+\" mb\") for x in fns_7z}\n", "print(sizes)\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Run the test with multiple cores and 1 single core\n", "\n", "So let’s use multiprocessing. The key component here is Pool which specifies how many cores that we want to use to process files at the same time (1 core per file). See the unzip_7z function which is quite self-explainable. Here I use the map function to run the extract_7z function on each .7z file.\n", "\n", "The xo_timer is a decorator that I wrote to compute time. You can ignore it in this tutorial :)" ] }, { "cell_type": "code", "execution_count": 20, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "time by using 16 cores.\n", "Function: unzip_7z, Time: 12.408593070000052\n", "time by using 1 core.\n", "Function: unzip_7z, Time: 45.80530025600001\n" ] } ], "source": [ "import py7zr\n", "from multiprocess import Pool\n", "from frenchnlp.utils import timer\n", "\n", "@timer\n", "def unzip_7z(workers, fns):\n", " with Pool(workers) as p:\n", " p.map(extract_7z, fns) \n", " \n", "def extract_7z(fn):\n", " with py7zr.SevenZipFile(fn, mode='r') as z:\n", " z.extractall(\"/Users/xiaoou/Downloads/frwac_7z/\")\n", "\n", "# Use all the cores (16 in this case)\n", "\n", "print(f\"time by using {workers} cores.\")\n", "unzip_7z(workers, fns_7z)\n", "\n", "# use 1 core\n", "\n", "print(f\"time by using 1 core.\")\n", "unzip_7z(1, fns_7z)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "When using 16 cores, the total time of unzipping these 4 files is 12.4 seconds, while using only 1 core takes 45.8 seconds!\n", "\n", "## Conclusion\n", "\n", "In this article we see how using multiple cores/cpus is possible in Python. Hopefully you see the power of parallelization and start to leverage this function in your work.\n", "Stay tuned for more tutorials on this subject :)" ] } ], "metadata": { "interpreter": { "hash": "40d3a090f54c6569ab1632332b64b2c03c39dcf918b08424e98f38b5ae0af88f" }, "kernelspec": { "display_name": "Python 3.7.6 64-bit ('base': conda)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.9.7" }, "orig_nbformat": 4 }, "nbformat": 4, "nbformat_minor": 2 }