How to speed up database inserts with many duplicates

or “how to not be stupid with your inserts”

I'm working on a small side project where I'm consuming public listing data from an API for fun and profit. The API will allow me to GET the newest items corresponding to a specific query, and paginates at a maximum page size of 50.

I store all listings that I've already seen in Postgres for later analysis but due to the nature of the API, I'm unable to pull only new listings and have to resort to somehow decide whether or not I should continue paginating.

Since I'm quite comfortable and productive with Python, I've implemented this as an APScheduler job, which will run at a given interval, say, every 10 minutes. Upon startup, the job will pull the first page of new listings, persist them and count how many duplicates it encountered. I decided that it was good enough™ to stop getting new pages whenever a full page of duplicate listings was encountered. I acknowledge that this is not the most robust design, especially when there are many new pages to be pulled and errors occur somewhere in the middle of the pagination. It might also miss a small number of listings around edge cases, but I found this to be negligible so far.

Sequential inserts are bad

So far so good. I wanted to get off the ground quickly so instead of thinking of a proper way of checking for duplicates, I would just do a sequential insert and catch the specific error thrown when duplicates are encountered:

for listing in listings:
  try:
    await connection.execute(
      'INSERT INTO listings (a, b, c) VALUES ($1, $2, $3)', listing.a, listing.b, listing.c)
  except asyncpg.exceptions.UniqueViolationError:
      duplicates.append(listing)
        continue

I mean, it's not pretty, but it works, right? Well, it did for the first prototype but after the project grew a bit, I realized that I'm wasting resources on the service, on the network and the database. The service tries to persist a listing that is already present, which will fail, but in this process, data needs to be serialized, put on the wire, deserialized on the DB, checked against constraints and finally a response with an appropriate error code needs to go back to the service which is now handling an error case that is not a real error case.

Logs looked something like this:

2020-02-17 09:17:18 INFO API GET took 6820.19 ms. Insert took 40122.35 ms and inserted 41 listings.
2020-02-17 09:22:28 INFO API GET took 6130.07 ms. Insert took 42001.81 ms and inserted 0 listings.
2020-02-17 09:22:28 INFO Found full page of duplicates. Exiting!

Finding a smarter way to insert

Since I'm running this on cheap VPSs, I don't want to waste resources, which are scarce from the get-go. The logs suggest that it takes around one full second for an insert per listing, whether or not that listing is eventually inserted. Inserting 50 listings takes the same amount of time as inserting none, which is obviously and utterly 💩. So after thinking about a better solution, I came to the realization that I can do a quick and relatively cheap membership check using the IDs of the listings I want to persist since they are primary keys and have an index by default.

listing_ids = tuple(map(lambda l: l.id, listings))

duplicate_ids = await connection.fetch(
    'SELECT id from listings WHERE id = any($1::int8[])',
    listing_ids)

which now allows me to write the insert code without catching UniqueViolationError and to use connection.executemany.

non_duplicate_listings = list(
    filter(lambda l: l.id not in duplicate_ids, listings))

insert_tuples = list(
    map(lambda l: l.to_insertable_tuple(), non_duplicate_listings))

await connection.executemany(
    'INSERT INTO listings (a, b, c) VALUES ($1, $2, $3)', insert_tuples)

Besides the fact that this feels much cleaner (and as a consequence, I don't feel like I need a shower after working with this service), the logs prove that it has been a good idea.

Results

Let's look at some cold hard data from the new version of the code:

2020-02-18 06:31:13 INFO API GET took 7022.19 ms. Insert took 6280.75 ms and inserted 39 listings.
2020-02-18 06:31:04 INFO API GET took 6446.77 ms. Insert took 235.86 ms and inserted 0 listings.
2020-02-18 06:31:04 INFO Found first page full of duplicates, exiting.

Inserting listings is now much faster, as many inserts can happen in one go, and not in a separate round-trip insert statement for each listing. Average insert time per listing went down from around 1000ms to around 160ms, which is an improvement of 6x. Dealing with pages full of duplicate listings became significantly faster as well because it can stop after determining that all IDs are already present in the database. It improved from 40-ish seconds to just 235ms or by a factor of 170, yes, freaking 170x.

Boom!

All in all, I am satisfied with my refactoring. Who could have thunk that doing things in a smarter way would lead to better results?