4 minutes
How to speed up database inserts with many duplicates
or “how to not be stupid with your inserts”
I'm working on a small side project where I'm consuming public listing data from
an API for fun and profit. The API will allow me to GET
the newest items
corresponding to a specific query, and paginates at a maximum page size of 50.
I store all listings that I've already seen in Postgres for later analysis but due to the nature of the API, I'm unable to pull only new listings and have to resort to somehow decide whether or not I should continue paginating.
Since I'm quite comfortable and productive with Python, I've implemented this as
an APScheduler
job, which will run at a given interval, say, every 10 minutes.
Upon startup, the job will pull the first page of new listings, persist them and
count how many duplicates it encountered. I decided that it was good enough™ to
stop getting new pages whenever a full page of duplicate listings was
encountered. I acknowledge that this is not the most robust design, especially
when there are many new pages to be pulled and errors occur somewhere in the
middle of the pagination. It might also miss a small number of listings around
edge cases, but I found this to be negligible so far.
Sequential inserts are bad
So far so good. I wanted to get off the ground quickly so instead of thinking of a proper way of checking for duplicates, I would just do a sequential insert and catch the specific error thrown when duplicates are encountered:
for listing in listings:
try:
await connection.execute(
'INSERT INTO listings (a, b, c) VALUES ($1, $2, $3)', listing.a, listing.b, listing.c)
except asyncpg.exceptions.UniqueViolationError:
duplicates.append(listing)
continue
I mean, it's not pretty, but it works, right? Well, it did for the first prototype but after the project grew a bit, I realized that I'm wasting resources on the service, on the network and the database. The service tries to persist a listing that is already present, which will fail, but in this process, data needs to be serialized, put on the wire, deserialized on the DB, checked against constraints and finally a response with an appropriate error code needs to go back to the service which is now handling an error case that is not a real error case.
Logs looked something like this:
2020-02-17 09:17:18 INFO API GET took 6820.19 ms. Insert took 40122.35 ms and inserted 41 listings.
2020-02-17 09:22:28 INFO API GET took 6130.07 ms. Insert took 42001.81 ms and inserted 0 listings.
2020-02-17 09:22:28 INFO Found full page of duplicates. Exiting!
Finding a smarter way to insert
Since I'm running this on cheap VPSs, I don't want to waste resources, which are scarce from the get-go. The logs suggest that it takes around one full second for an insert per listing, whether or not that listing is eventually inserted. Inserting 50 listings takes the same amount of time as inserting none, which is obviously and utterly 💩. So after thinking about a better solution, I came to the realization that I can do a quick and relatively cheap membership check using the IDs of the listings I want to persist since they are primary keys and have an index by default.
listing_ids = tuple(map(lambda l: l.id, listings))
duplicate_ids = await connection.fetch(
'SELECT id from listings WHERE id = any($1::int8[])',
listing_ids)
which now allows me to write the insert code without catching UniqueViolationError
and to use connection.executemany
.
non_duplicate_listings = list(
filter(lambda l: l.id not in duplicate_ids, listings))
insert_tuples = list(
map(lambda l: l.to_insertable_tuple(), non_duplicate_listings))
await connection.executemany(
'INSERT INTO listings (a, b, c) VALUES ($1, $2, $3)', insert_tuples)
Besides the fact that this feels much cleaner (and as a consequence, I don't feel like I need a shower after working with this service), the logs prove that it has been a good idea.
Results
Let's look at some cold hard data from the new version of the code:
2020-02-18 06:31:13 INFO API GET took 7022.19 ms. Insert took 6280.75 ms and inserted 39 listings.
2020-02-18 06:31:04 INFO API GET took 6446.77 ms. Insert took 235.86 ms and inserted 0 listings.
2020-02-18 06:31:04 INFO Found first page full of duplicates, exiting.
Inserting listings is now much faster, as many inserts can happen in one go, and
not in a separate round-trip insert statement for each listing. Average insert
time per listing went down from around 1000ms
to around 160ms
, which is an
improvement of 6x
. Dealing with pages full of duplicate listings became
significantly faster as well because it can stop after determining that all IDs
are already present in the database. It improved from 40-ish seconds to just
235ms
or by a factor of 170, yes, freaking 170x
.
All in all, I am satisfied with my refactoring. Who could have thunk that doing things in a smarter way would lead to better results?