Parallelization in Python: a beginner’s guide (1, using map)¶

Xiaoou WANG

Parallelization is very useful in a lot of daily tasks, however tutorials often begin with obscure explanations about multithreading and multiprocessing. Don’t get me wrong, these concepts are crucial in complex scenarios, however they are intimidating and unnecessary for beginners.

Let’s use a code-first and example-driven approach to introduce parallelization in Python.

Suppose you have 4 xml files zipped in .7z (a kind of compressed file) and you want to unzip them.

Instinctively, you say to yourself that the best approach is to unzip all the files altogether instead of processing 1 by 1.

The altogether way is parallelization.

Check how many cores that I have on my computer¶

Please note that I use multiprocess here instead of multiprocessing because the latter has some issue with Jupyter Notebook. See here.

We have 16 cpus :)

[18]:

import multiprocess as mp

print("Number of processors: ", mp.cpu_count())

import os

workers = os.cpu_count()
print(workers)

Number of processors:  16
16

Get the filenames and have a peek at the size of each file¶

In our case, their extensions are .7z and they are located in the /Users/xiaoou/Downloads/frwac_7z/ directory. We can use glob to get the list of filenames. As you can see from the output, these files are quite huge and can take some time if we unzip them one by one.

[19]:

import glob

def get_fns(dir):
    return glob.glob(dir)

def get_size(fn, unit="mb"):
    if unit == "mb":
        return round(os.path.getsize(fn)/(1024*1024), 2)

pattern = "/Users/xiaoou/Downloads/frwac_7z/*.7z"
fns_7z = get_fns(pattern)
sizes = {x: (str(get_size(x))+" mb") for x in fns_7z}
print(sizes)

{'/Users/xiaoou/Downloads/frwac_7z/frwac_subset_100M.7z': '144.86 mb', '/Users/xiaoou/Downloads/frwac_7z/frwac_subset_100Mcopy.7z': '144.86 mb', '/Users/xiaoou/Downloads/frwac_7z/frwac_subset_100Mcopy3.7z': '144.86 mb', '/Users/xiaoou/Downloads/frwac_7z/frwac_subset_100Mcopy2.7z': '144.86 mb'}

Run the test with multiple cores and 1 single core¶

So let’s use multiprocessing. The key component here is Pool which specifies how many cores that we want to use to process files at the same time (1 core per file). See the unzip_7z function which is quite self-explainable. Here I use the map function to run the extract_7z function on each .7z file.

The xo_timer is a decorator that I wrote to compute time. You can ignore it in this tutorial :)

[20]:

import py7zr
from multiprocess import Pool
from frenchnlp.utils import timer

@timer
def unzip_7z(workers, fns):
    with Pool(workers) as p:
        p.map(extract_7z, fns)

def extract_7z(fn):
    with py7zr.SevenZipFile(fn, mode='r') as z:
        z.extractall("/Users/xiaoou/Downloads/frwac_7z/")

# Use all the cores (16 in this case)

print(f"time by using {workers} cores.")
unzip_7z(workers, fns_7z)

# use 1 core

print(f"time by using 1 core.")
unzip_7z(1, fns_7z)

time by using 16 cores.
Function: unzip_7z, Time: 12.408593070000052
time by using 1 core.
Function: unzip_7z, Time: 45.80530025600001

When using 16 cores, the total time of unzipping these 4 files is 12.4 seconds, while using only 1 core takes 45.8 seconds!

Conclusion¶

In this article we see how using multiple cores/cpus is possible in Python. Hopefully you see the power of parallelization and start to leverage this function in your work. Stay tuned for more tutorials on this subject :)