-- Leo's gemini proxy

-- Connecting to gemini.bunburya.eu:1965...

-- Connected

-- Sending request

-- Meta line: 20 text/gemini; lang=en-IE

Generating calendar events from emails


I am not the most innately organised person, so I am a fan of using a (digital) calendar to keep track of my personal life: when I am traveling, when I am meeting up with friends or going to the cinema, etc. It helps ensure I don't miss appointments or double-book myself.


But if you have a lot of bookings--for example, if you are planning a vacation and booking multiple flights, trains and hotels--it can be tedious to manually create calendar events for each booking with the correct times, location and other details. The problem is compounded when you are dealing with different timezones. Some service providers give you the option to add bookings to your calendar, but often they don't, and even when they do, it doesn't always work properly (for example, British Airways give you the option to add flights to a calendar, but in my experience they don't get the timezones correct).


For a long time, I was looking for an open source solution that would accomplish this, and considered implementing something myself. I started by writing a Python script that would parse a Ryanair flight confirmation email. It was surprisingly easy, as Ryanair's HTML confirmation emails contain tags with information conforming to schema.org's FlightReservation schema (and a number of other related schemas). However, some other airlines' confirmation emails aren't so easily parsed.


Kitinerary


Eventually I found out that a number of KDE developers have been working on a solution to this problem, in the form of software called KDE Itinerary. KDE Itinerary is a "digital travel assistant" that aims to handle many aspects of travel, including navigation and boarding pass management. One of its key features is extracting data about itinerary items from various formats, including emails and PDFs. The data extraction engine is maintained separately as a C++ library called KItinerary, which can be integrated in third party software or used from the command line via an executable called `kitinerary-extractor` which is available as a flatpak. Support for over 250 different service providers (airlines, train companies, booking websites, etc) is included in KItinerary and it is relatively easy to add new parsing functionality with a bit of JavaScript.


https://invent.kde.org/pim/kitinerary


KItinerary is a powerful piece of software, though I don't love its KDE dependencies and I would prefer something that can be run natively on non-KDE systems. Installing kitinerary-extractor via flatpak is easy but takes up a lot of disk space if you don't already have KDE dependencies installed (apparently about 1.6 GB including all dependencies on my system). That is a lot, but disk space is cheap these days and if you have it to spare then kitinerary-extractor gives you an easy way to leverage the excellent parsing and post-processing work done by the KItinerary devs.


kitinerary-extractor reads in an email file and, by default, outputs parsed data as JSON-LD. The data structure conforms to the schema.org ontology. If you provide the `-o iCal` argument, instead of JSON-LD, it outputs parsed data as an iCalendar file.


https://json-ld.org/

https://schema.org/docs/schemas.html

https://en.wikipedia.org/wiki/ICalendar


Building a workflow


KItinerary apparently integrates with KMail as well as Nextcloud Mail, so if you use either of those applications for email, you can use it quite easily. Personally I don't, so I set out to roll my own solution. Ideally, what I want is something that periodically checks my inbox for new emails and, if it finds emails that contain information about a booking or event, adds the booking or event to my calendar. Over the course of a long weekend I hacked something together that uses mbsync to fetch emails, KItinerary to parse them into iCalendar files and Python to do some post-processing and deliver the iCalender files via email.


Fetching email with mbsync


mbsync is a tried and trusted tool for syncing two mailboxes, and can be used to download emails from a remote IMAP mailbox to local storage. Confusingly, the project is called isync but the executable itself is called mbsync. The isync package is available in the repos of most major Linux distributions.


Below is a rough outline of a configuration file (usually stored at ~/.mbsyncrc) that can be used to fetch new emails from an IMAP mailbox.


# Define a local mailbox, in the Maildir format
MaildirStore mailbox-local
Path ~/Mail/
Inbox ~/Mail/

# Define a remote IMAP mailbox
IMAPStore mailbox-remote
Host <insert IMAP server address>
User <insert email address>
PassCmd <insert command to access password>

# Fetch new messages for processing
Channel mailbox-fetch
Master :mailbox-remote:
Slave :mailbox-action:
Sync Pull New ReNew

For more information on how mbsync configuration works, consult its man page. A couple of points to note:


The `PassCmd` value should be a shell command that can be used to get the password to your IMAP server. A common choice is to use `pass`, a popular command line password manager. Instead of PassCmd, you could include your password directly in the config file using the `Pass` directive, but of course this has security implications. Alternatively, you can provide neither directive, and mbsync will prompt you for a password at runtime.

The final line, `Sync Pull New ReNew`, tells mbsync to pull (download) new messages since it was last run, including messages that were found on a previous run but not downloaded for some reason. mbsync is capable of two-way sync, but here we just want to pull new messages from the IMAP server and not send anything the other way.


You then simply run mbsync like so (the argument corresponds to the name of the channel we defined in the config file):


mbsync mailbox-fetch

This will download new messages from your IMAP server and store them under ~/Mail. Three new subdirectories will be created: `cur/`, `new/` and `tmp/`. `new/` stores messages marked as unread and `cur/` stores messages marked as read.


mbsync will keep track of what it has downloaded, so it will not download the same email multiple times, even if the local version is deleted.


https://isync.sourceforge.io/

https://en.wikipedia.org/wiki/Maildir

https://www.passwordstore.org/


Extracting data with KItinerary


Now that you have downloaded your emails, you can feed them to kitinerary-extractor to extract the details of the events (if any) they describe.


If you have downloaded kitinerary-extractor via flatpak, then the correct command to run it is:


flatpak run org.kde.kitinerary-extractor <arguments>

That looks a bit ugly so let's wrap it in a simple shell script which we will call

`kitinerary-extractor`:


#!/bin/sh

exec flatpak run org.kde.kitinerary-extractor "$@"

The calling it is simply a matter of:


kitinerary-extractor <arguments>

kitinerary-extractor takes an optional `--output` argument, which can be "JsonLd" (the default) or "iCal". Specifying iCal output will cause kitinerary-extractor to print out an iCalendar (.ics) file with information about the event described in the email:


kitinerary-extractor -o iCal my_email_file.eml

The output is rather lengthy so I won't reproduce it here, but I suggest you experiment on some emails of your own.


Post-processing


If you call the above command on an email that doesn't contain any information that kitinerary-extractor knows how to extract, it will output an empty iCalendar file (ie, one with a root VCALENDAR object but without any VEVENT objects). There are some types of email that kitinerary-extractor will extract *some* information from, but which do not correspond to calendar events. For example, it seems to do this on emails from eBay or LinkedIn. In these cases, it will (rather unhelpfully) output an iCalendar file which contains an event (VEVENT) object, but no start or end time.


Therefore, if you are calling kitinerary-extractor on every email you receive, you will need to check the resulting iCalendar file to ensure that it contains at least one VEVENT object that has start time (DTSTART) and end time (DTEND) properties.


We can do this using Python and the popular `icalendar` library:


from icalendar import Calendar

def has_real_event(cal: Calendar) -> bool:
    for evt in cal.walk("VEVENT"):
        if ("DTSTART" in evt) and ("DTEND" in evt):
            return True
    return False

If you are sending the event by email, the recipient email address should be listed as an attendee. Otherwise, when (for example) I click to accept the invitation in Thunderbird, I get a dialog telling me I'm not on the guest list. It still lets me add it to my calendar, but it's annoying.


def add_attendee(cal: Calendar, email_addr: str) -> Calendar:
    """Add `email_addr` as an attendee to each event in `cal`. Modifies `cal`
    in-place.
    """
    for evt in cal.walk("VEVENT"):
        a = vCalAddress(f"MAILTO:{email_addr}")
        a.params["ROLE"] = vText("REQ-PARTICIPANT")
        evt.add("attendee", a, encode=0)
    return cal

kitinerary-extractor outputs one iCalendar file per email that it parses. If you are processing multiple emails, it may be more convenient to get one iCalendar file with multiple events rather then multiple files. You can merge a number of VCALENDAR objects into a single VCALENDAR, like so:


def merge_calendars(cals: Collection[Calendar]) -> Calendar:
    """Merge a number of calendars into one, which has the timezone and event
    info from all calendars.
    """
    # Keep track of the timezones we've already added
    added_tzids = set()

    new_cal = Calendar()

    for c in cals:

        # Add timezone definitions to new calendar (avoiding duplication)
        for tz in c.walk("VTIMEZONE"):
            tzid = tz["TZID"]
            if tzid not in added_tzids:
                new_cal.add_component(tz)
                added_tzids.add(tzid)

        # Add events to new calendar
        for evt in c.walk("VEVENT"):
            new_cal.add_component(evt)
    return new_cal

To tie this all together, we use Python's `mailbox` module (part of the standard library) to iterate through the new emails we fetched with mbsync, process them one by one and merge the resulting calendars into a single calendar:


import subprocess
from datetime import datetime
from mailbox import Maildir
from email.message import Message
from typing import Optional

from icalendar import Calendar, Event, vCalAddress, vText

CMD = ["/usr/bin/flatpak", "run", "org.kde.kitinerary-extractor", "-o", "iCal"]

def process_email(email: Message) -> Optional[Calendar]:
    """Process `email`, determining whether it contains a relevant event and
    adding it to `main_cal` if so.

    output = subprocess.run(CMD, input=email.as_bytes(), capture_output=True)
    if output.returncode:
        # kitinerary-extractor returned an error
        return
    cal = Calendar.from_ical(output.stdout.decode())
    if has_real_event(cal):
        return cal

def process_mailbox(
        mb: Maildir,
        email_addr: Optional[str] = None
    ) -> Optional[tuple[Calendar, list[str]]]:
    """Process each email in `mailbox`, returning a calendar containing all
    parsed events (or None if no events were found). Also return a list of
    details of emails that had events. `email_addr` will be added as an
    attendee to each event.
    """

    cals = []
    emails = []
    for msg in mb:
        c = process_email(msg)
        if c is not None:
            cals.append(c)
            emails.append(" ".join((
                msg.get("Date"),
                msg.get("From"),
                msg.get("Subject")
            )))
    if cals:
        c = merge_calendars(cals)
        if email_addr:
            add_attendee(c, email_addr)
        return c, emails

The above function takes, as its first argument, a `mailbox.Maildir` object representing a Maildir directory. In the example mbsync configuration we looked at above, the Maildir directory is `~/Mail/`. You can initialise the object to pass to `process_mailbox` like so:


from mailbox import Maildir

mb = Maildir("~/Mail")

As well as returning a Calendar object containing the relevant events, the `process_mailbox` function returns a list of strings containing some basic information about the emails that were found to contain information about events.


Neither the above function nor mbsync will automatically remove fetched emails once you are finished with them, so you should do this manually to avoid repeatedly parsing the same emails every time.


https://icalendar.readthedocs.io

https://docs.python.org/3/library/mailbox.html


Delivery


Now that you have an iCalendar file, you need a way to actually get it into your calendar. If your calendar supports the CalDAV protocol, you may be able to do this directly using a CalDAV client. Here, we will just use Python to send the calendar as an email attachment via SMTP.


import smtplib
from email.message import Message, EmailMessage
from typing import Iterable

def email_calendar(
        to: str,
        subject: str,
        event_details: Iterable[str],
        cal: Calendar,
        sender: str,
        passwd: str,
        smtp_server: str,
        smtp_port: int = 587,
    ):
    """Send an email with attachment."""
    msg = EmailMessage()
    msg["Subject"] = subject
    msg["From"] = sender
    msg["To"] = to
    msg.set_content('\n'.join(event_details))
    msg.add_attachment(
        cal.to_ical(),
        maintype="text",
        subtype="calendar",
        filename="events.ics"
    )
    with smtplib.SMTP(smtp_server, smtp_port) as server:
        server.starttls()
        server.login(sender, passwd)
        server.send_message(msg)

As well as a Calendar object, this function takes (as its `event_details` argument) the list of strings returned by `process_mailbox` describing the relevant emails. It sends those details as the body of the email, whereas the calendar is sent as an attachment.


Other than that, you'd need to provide the recipient email address, sender email address, sender password, SMTP server and port number (which defaults to 587, the standard port number for encrypted SMTP).


What's next?


The above examples should allow you to quickly set up a fairly rudimentary process for periodically fetching new emails, extracting information about events and adding the events to your calendar. It could certainly be improved on in several ways, to make it easier to use and more featureful. Below are a couple of examples of improvements that could be made, though these are left as an exercise to the reader (and the author).


Create more extractors


Kitinerary relies on extractor scripts to extract detailed information from emails and other documents. An impressive number of extractor scripts are bundled with the Kitinerary library, but it is also possible to write your own. You can then tell kitinerary-extractor to use one or more additional extractors using the `--extractors` argument, or direct it to a directory of extractors using the `--additional-search-path` argument.


Each extractor consists of a JavaScript file to extract the data from the relevant document (for example, by operating on the DOM of a HTML email), coupled with a JSON file which contains a filter that KItinerary uses to determine which extractor to use on which file. Filters use pattern-matching against various fields in the document--for example, a filter might match against an email where the "From" header includes "example.com".


There are detailed instructions on how to write an extractor in the README of the KItinerary project, linked above. There is also an application called KItinerary Workbench, also available via flatpak, which is very helpful for writing and debugging custom extractors.


If you do write and test an extractor you think others would find helpful, you should consider contributing it to the KItinerary library.


https://github.com/KDE/kitinerary/tree/master/src/lib/scripts

https://invent.kde.org/pim/kitinerary-workbench


More fine-tuned post-processing


In the example above we told kitinerary-extractor to output the extracted information as an iCalendar file, and then did some light post-processing in Python. This is quite convenient if your ultimate goal is to include the event in your calendar and kitinerary-extractor is pretty good at outputting a useful iCalendar file, but often the JSON-LD format that kitinerary-extractor uses by default will contains more machine-readable information about the event that you could use for further post-processing. Using Python or your preferred scripting language to parse JSON-LD output generated by kitinerary-extractor would allow you to ultimately create an iCalendar file with the exact information, and in the exact format, that you desire.




Generating calendar events from emails was published on 2023-08-05

Return to index

-- Response ended

-- Page fetched on Sat May 4 13:50:01 2024