Complete tutorial on scraping French news from Le Monde 🇬🇧

Xiaoou WANG

Why you need web scraping and why I wrote this tutorial

Sooner or later in your career you will find yourself in a situation where you need to build your own corpus for common nlp tests.

Some days ago I wrote a tutorial on scraping news on the French journal le monde. I didn’t finish the whole tutorial and today while refreshing some web scraping skills I find the best way to keep a personal knowledge base is writing complete, well documented tutorials. It creates a more robuste and contextualised documentation (like Bert :D) than simply commenting code.

Get article links

So the first thing to do is to generate archive links, basically an archive link is a page on which we can scrape article links. For instance, the link https://www.lemonde.fr/archives-du-monde/01-01-2020/ contains all the articles published for 2020–01–01. Here I create the function create_archive_link which requires starting year/month/day and ending year/month/day as input. The output is a dictionary in the form of year:links.

def create_archive_links(year_start, year_end, month_start, month_end, day_start, day_end):
    archive_links = {}
    for y in range(year_start, year_end + 1):
        dates = [str(d).zfill(2) + "-" + str(m).zfill(2) + "-" +
                 str(y) for m in range(month_start, month_end + 1) for d in
                 range(day_start, day_end + 1)]
        archive_links[y] = [
            "https://www.lemonde.fr/archives-du-monde/" + date + "/" for date in dates]
    return archive_links

Example output of create_archive_links(2006,2020,1, 12, 1, 31):

The next step is to get all the article links on the archive page. For this you need 3 modules: HTTPError to handle exceptions, urlopen to open webpages and BeautifulSoup to parse webpages.

The magic power of exception handling

The exception handling is necessary because we have archive pages for dates like 02–31, it’s much easier to handle exceptions than generating links corresponding only to existent dates. Each article is in a <section> having a class named teaser.

I also filter out all the non-free articles having a span with class icon__premium. All the links containing the word en-direct are also filtered because they are videos. This is to say that web scraping requires not only programming skills but also some elementary web analysis.

from urllib.error import HTTPError
from urllib.request import urlopen
from bs4 import BeautifulSoup
def get_articles_links(archive_links):
    links_non_abonne = []
    for link in archive_links:
        try:
            html = urlopen(link)
        except HTTPError as e:
            print("url not valid", link)
        else:
            soup = BeautifulSoup(html, "html.parser")
            news = soup.find_all(class_="teaser")
            # condition here : if no span icon__premium (abonnes)
            for item in news:
                if not item.find('span', {'class': 'icon__premium'}):
                    l_article = item.find('a')['href']
                    # en-direct = video
                    if 'en-direct' not in l_article:
                        links_non_abonne.append(l_article)
    return links_non_abonne

You don’t want to scrape more than once

Since no body wants to scrape the same links again and again (they are not very likely to change), a handy function can be created to save links into files. Here the publication year is used to name files.

def write_links(path, links, year_fn):
    with open(os.path.join(path + "/lemonde*" + str(year_fn) + "\_links.txt"), 'w') as f:
        for link in links:
            f.write(link + "\n")

article_links = {}

for year,links in archive_links.items():
    print("processing: ",year)
    article_links_list = get_articles_links(links)
    article_links[year] = article_links_list
    write_links(corpus_path,article_links_list,year)

This would produce something like this:

Let’s scrape

Now you can scrape article contents, it’s surprisingly straightforward. In fact you just need to get all the h1,h2 and p elements. The recursive=False is important here because you don’t want to dig any deeper once you find the text at some first level.

For concerns of modularity you should first write a function to scrape a single page.

def get_single_page(url):
    try:
        html = urlopen(url)
    except HTTPError as e:
        print("url not valid", url)
    else:
        soup = BeautifulSoup(html, "html.parser")
        text_title = soup.find('h1')
        text_body = soup.article.find_all(["p", "h2"], recursive=False)
        return (text_title, text_body)

Let’s say you need to classify news by theme (which is very common in text classification tasks), you can use the following function to extract themes from links. For example the link https://www.lemonde.fr/politique/article/2020/01/01/reforme-des-retraites-macron-reste-inflexible-aucune-issue-ne-se-profile_6024550_823448.html contains the keyword politique meaning politics in French.

def extract_theme(link):
    try:
        theme_text = re.findall(r'.fr/.*?/', link)[0]
    except:
        pass
    else:
        return theme_text[4:-1]

You can also get the top n themes.

themes = []
def list_themes(links):
    themes = []
    for link in links:
        theme = extract_theme(link)
        if theme is not None:
            themes.append(theme)
    return themes
from collections import Counter
theme_stat = Counter(themes)
theme_top = []
for k,v in sorted(theme_stat.items(), key = lambda x:x[1], reverse=True):
    if v > 700:
        theme_top.append((k, v))
themes_top_five = [x[0] for x in theme_top[:5]]

Now you need to have the links corresponding to the top 5 themes, so:

from collections import defaultdict
def classify_links(theme_list, link_list):
    dict_links = defaultdict(list)
    for theme in theme_list:
        theme_link = 'https://www.lemonde.fr/' + theme + '/article/'
        for link in link_list:
            if theme_link in link:
                dict_links[theme].append(link)
    return dict_links
all_links = []
for link_list in article_links.values():
    all_links.extend(link_list)
themes_top_five_links = classify_links(themes_top_five,all_links)

It’s time to scrape all the articles that you need. Typically you would want to save all the scraped text in a folder. Note that this function takes a dictionary object with theme as key and corresponding links as value. For example here is a dict example for the top 5 themes.

links_dict = {key: value for key, value in themes_top_five_links.items()
}

Finally, you can scrape ! Note the tqdm around range, since the whole scraping process could be quite long, you need to keep track of the progress so that you know it’s always running. See how to use tqdm here https://tqdm.github.io/. Below is a progress bar example copied from the official documentation.

def scrape_articles(dict_links):
    themes = dict_links.keys()
    for theme in themes:
        create_folder(os.path.join('corpus', theme))
        print("processing:", theme)
        #### note the use tqdm
        for i in tqdm(range(len(dict_links[theme]))):
            link = dict_links[theme][i]
            fn = extract_fn(link)
            single_page = get_single_page(link)
            if single_page is not None:
                with open((os.path.join('corpus', theme, fn + '.txt')), 'w') as f:
                    # f.write(dict_links[theme][i] + "\n" * 2)
                    f.write(single_page[0].get_text() + "\n" * 2)
                    for line in single_page[1]:
                        f.write(line.get_text() + "\n" * 2)

Now that you get all the articles as txt, here is a function allowing you to extract any number of texts from each folder. The default number is set to 1000 here.

def cr_corpus_dict(path_corpus, n_files=1000):
    dict_corpus = defaultdict(list)
    themes = os.listdir(path_corpus)
    for theme in themes:
        counter = 0
        if not theme.startswith('.'):
            theme_directory = os.path.join(path_corpus, theme)
            for file in os.listdir(theme_directory):
                if counter < n_files:
                    path_file = os.path.join(theme_directory, file)
                    text = read_file(path_file)
                    dict_corpus["label"].append(theme)
                    dict_corpus["text"].append(text)
                counter += 1
    return dict_corpus

Resources

Here you are. Happy scraping! And mind the potential legal issues!

You can find source code here.