Parallelization in Python: a beginner’s guide (1, using map)¶
Parallelization is very useful in a lot of daily tasks, however tutorials often begin with obscure explanations about multithreading and multiprocessing. Don’t get me wrong, these concepts are crucial in complex scenarios, however they are intimidating and unnecessary for beginners.
Let’s use a code-first and example-driven approach to introduce parallelization in Python.
Suppose you have 4 xml files zipped in .7z (a kind of compressed file) and you want to unzip them.
Instinctively, you say to yourself that the best approach is to unzip all the files altogether instead of processing 1 by 1.
The altogether
way is parallelization.
Check how many cores that I have on my computer¶
Please note that I use multiprocess
here instead of multiprocessing
because the latter has some issue with Jupyter Notebook. See here.
We have 16 cpus :)
[18]:
import multiprocess as mp
print("Number of processors: ", mp.cpu_count())
import os
workers = os.cpu_count()
print(workers)
Number of processors: 16
16
Get the filenames and have a peek at the size of each file¶
In our case, their extensions are .7z
and they are located in the /Users/xiaoou/Downloads/frwac_7z/
directory. We can use glob
to get the list of filenames. As you can see from the output, these files are quite huge and can take some time if we unzip them one by one.
[19]:
import glob
def get_fns(dir):
return glob.glob(dir)
def get_size(fn, unit="mb"):
if unit == "mb":
return round(os.path.getsize(fn)/(1024*1024), 2)
pattern = "/Users/xiaoou/Downloads/frwac_7z/*.7z"
fns_7z = get_fns(pattern)
sizes = {x: (str(get_size(x))+" mb") for x in fns_7z}
print(sizes)
{'/Users/xiaoou/Downloads/frwac_7z/frwac_subset_100M.7z': '144.86 mb', '/Users/xiaoou/Downloads/frwac_7z/frwac_subset_100Mcopy.7z': '144.86 mb', '/Users/xiaoou/Downloads/frwac_7z/frwac_subset_100Mcopy3.7z': '144.86 mb', '/Users/xiaoou/Downloads/frwac_7z/frwac_subset_100Mcopy2.7z': '144.86 mb'}
Run the test with multiple cores and 1 single core¶
So let’s use multiprocessing. The key component here is Pool which specifies how many cores that we want to use to process files at the same time (1 core per file). See the unzip_7z function which is quite self-explainable. Here I use the map function to run the extract_7z function on each .7z file.
The xo_timer is a decorator that I wrote to compute time. You can ignore it in this tutorial :)
[20]:
import py7zr
from multiprocess import Pool
from frenchnlp.utils import timer
@timer
def unzip_7z(workers, fns):
with Pool(workers) as p:
p.map(extract_7z, fns)
def extract_7z(fn):
with py7zr.SevenZipFile(fn, mode='r') as z:
z.extractall("/Users/xiaoou/Downloads/frwac_7z/")
# Use all the cores (16 in this case)
print(f"time by using {workers} cores.")
unzip_7z(workers, fns_7z)
# use 1 core
print(f"time by using 1 core.")
unzip_7z(1, fns_7z)
time by using 16 cores.
Function: unzip_7z, Time: 12.408593070000052
time by using 1 core.
Function: unzip_7z, Time: 45.80530025600001
When using 16 cores, the total time of unzipping these 4 files is 12.4 seconds, while using only 1 core takes 45.8 seconds!
Conclusion¶
In this article we see how using multiple cores/cpus is possible in Python. Hopefully you see the power of parallelization and start to leverage this function in your work. Stay tuned for more tutorials on this subject :)