The journey from request timing-out to taking less than 200ms
Stratumn is an aging piece of software, built many years ago. It has been serving multiple clients in production around the world for more than 7 years now. At the scale the software operates, many pieces have been deteriorating, and are starting to break, including one of the most important ones : the workflow overview. The workflow overview is a giant data table showing all the different process you have been, or need to be working on. This is the hub of the software.

This part of the software has been declining for months, to the point where some requests were cut by the Application Load Balancer (because they were exceeding the maximum time limit). Even more disturbing, database usage was to the roof.
The root cause of the problem was a poor use of Postgres : bad RLS and awful generated SQL queries resulting in complete SEQ SCAN of the base.
This is the kind of spike we had every day, for no real reasons.

Let’s dive through all the bottleneck we faced in the Stratumn architecture and what changes helped us make this software blazingly fast 🚀 !
At Stratumn we use GraphQL for our frontend backend communication layer. It’s a great piece of technology. It helps decouple frontend teams from the backend team, and most of the time make feature development faster.
GraphQL is not tech debt, for our company, our use, our team (5 people), it’s great. To power GraphQL we use Postgraphile, another great piece of technology, it has powered Stratumn engine for long time now, and done its job very well.
We never have encountered any bug caused by the technology. All of the time it was our fault.
Postgraphile automagically creates a GraphQL schema out of Postgres. This helps reduce the GraphQL boilerplating, reducing the overall development time. It’s get the job done, pretty well.
But everything come at a cost : Postgraphile builds the GraphQL schema out of Postgres’ one, and this comes with limitation, especially on the authorization side. Postgraphile imposes the authorization layer: to have enforce authorization on a table, you need to use RLS.
This is a bit strange at first, because RLS are not that common to use, and so they are misunderstood.
Recently Stratumn signed a new client whose scale is 1000 bigger than our second largest one (our production database went from 1 unit to 1000 unit, unit is arbitrary).
This has broke the application in so many place, but in one in particular : the permission and authorization layer. Our old RLS design did not scaled well because we did a lot of mistake in the design of them.
Switching the entire authorization layer of a “in production” software is not easy. We did it in less than 3 month, with complete end to end coverage of the application. Let’s explore how.
RLS are a Postgres feature made to ensure security while performing action on a table. This ensures any query made to database will not leak data (if the RLS is well written, of course) : even by doing some weird injection you would not be able to see the data, because this will never be shown to you.
Postgres itself will filter it. Bad day for hackers.
RLS also create a security net : you can allow any developers, even the most junior, write database queries, and you are sure they will not be able to leak data, even in very complex join query, where a where clause can be very quickly forget. What a treat !
The RLS works by providing a SQL query to the policy, this query must return a boolean. If the boolean returns true, the line is readable, if not, the line does not exists, vanished.
I’m a “learn by example” guy, so I will try to illustrate how everything works. Let’s say that you have a database, with a document table inside. Each user can see the documents he owns (make sense).
In “normal Postgres”, if you would like to query the documents of a user you will have some kind of endpoint querying the database with a query like :
SELECT * FROM document WHERE owner_id = current_setting('app.user_id')::INTEGER;-- orSELECT * FROM document WHERE owner_id = 42;If you don't know about current_setting in Postgres, I highly encourage to explore them, they are very useful.
If you wanted to do the same query, but in RLS paradigm, you would have to first, set up the RLS like so :
CREATE POLICY user_access_documents ON documentFOR SELECTTO publicUSING (owner_id = current_setting('app.user_id')::INTEGER);and then do the same query as before with the where clause removed :
SELECT * FROM documentThis looks good, and sound very scalable. And for this example, you would be right. In term of performance I would also argue that you lost nothing because the Postgres is so smart, I’m 99.99% sure it’s translated the RLS in some kind of “where” a like in the planning phase. So it would not make any difference for the end user in this example.
If you set up the RLS properly. Setup we did the wrong way. Obviously.
A query in Postgres can be more than just a SELECT clause : it can call Postgres functions. At Stratumn, every RLS calls a Postgres function (for clarity). Having function instead of query is great, and should not impact performance itself.
You have to be very very very careful with this function, and treat it with all the respect it deserves, because otherwise, it can destroy your Postgres performance and make your queries time go to the moon.
Indeed, if Postgres query planner cannot optimise your RLS, it will have to go line by line in your database, looking for a good match. This is called sequential scanning, and you want to avoid that all cost.
This is like finding a word in the dictionary by looking at each words one by one until you find the right one. Not efficient you agree.
This is exactly what we had in production for many year at Stratumn. But because the scale was low, the bottleneck never really appeared; Postgres is so fast.
Before optimization, querying one of our table could take seconds, to get 0 results.
Yes. This is because Postgres has to verify every rows of the table, without using any kind of index optimization. I’m sure Postgres uses multithreading read, and was able to read many line at once. But this time all of Postgres engineer’ works was for nothing. Our code was too bad. We won (at what cost??)
This was the result of the explain analyze before the optimization :

This was so slow, because as you can see, Postgres had no choice but to execute the loop for each line of the database. And at each line, execute the workflow_can_select(workflow.*) function. Also, as you can guess, the workflow_can_select function was not optimized at all.
In this RLS design there is one fatal flaw : this RLS functions needs to have the ID of the line it wants to authorize.
Which is the worst thing you could do in RLS. That force Postgres to execute the function for every line of the table, creating this seq scan nightmare.

As you can guess, there is a way to solve this problem. And for us, the gain were massive : more than 250 times improvements (which is crazy to be honest): we had to re-write the RLS.
The trick was to reverse the RLS. (To be honest, this is not a trick, this is written everywhere in Postgres RLS literature)
The right solution is to get all the line the user can read, in one query, and use the result of this query as the RLS. Moving from a “can I read the line 23432?” to a “is the line 23432 in the array of authorized lines?”.

Instead of the
CREATE POLICY workflow_can_select_policy ON workflowFOR SELECTTO publicUSING (workflow_can_select(workflow.*));We used
CREATE POLICY workflow_can_select_policy ON workflowFOR SELECTTO publicUSING (This changes everything. In the second query, Postgres will run only once the expensive get_all_workflow_select_ids function. And then cache the query result for each line, transforming the expensive RLS in a simple where clause. Nice.
This gave us, this explain analyze result

As you can see, Postgres performs two plan (the first one is to check if you are a superadmin, this is not relevant). The second one is the actual function planning, and then use it to filter the workflows.
Reducing the query time from 65ms to 0.3ms. BLAZINGLY FAST. This improvement is even bigger on production data, where the database is massive, but the seq scans grows linearly with database size.
This is a highly impactful change to our application. And even if I’m pretty much sure that I did not changed the permission logic or messed something, I did not trusted myself.

And as every sane person would do in this situation, I wrote a program to check all of the permission, for every user, for every object, on all the environments Stratumn operate (this is one big advantage of not having a massive data scale)
The idea is pretty simple: for every user, for every important query the user would want to do, check what he can sees on the old system, and then check if it sees the same results in the new system. The program only break if it find a difference between the old and the new results.
This process took day to run, at night, on production data. But this was the only way we could really test nothing was messed up.
After weeks of testing the change was applied in production and our query time dropped.
Slow RLS were only the tip of the iceberg. A giant of slowness was waiting for me, in the dark : filtering and querying the JSON data.
Stratumn is a business process manager : we developed an engine capable of running any kind of business workflow. The engine is totally flexible, any kind of workflow can be created. The only limit is your imagination 🌈 !
This is great for the product team and for the CSM, but a nightmare for the engineering team. We have to provide an engine capable of filtering, ordering and grouping any kind of JSON, of course without any kind of schema, and pretty fast please. Great.
We use Postgraphile for the GraphQL query interface. This is a great piece of technology, it can create very sophisticated Postgres queries from any GraphQL query. But after the years our Postgres schema became too convoluted, too complex.
Thanks to it and all the custom Postgres function we created everywhere, Postgraphile was not able to create any kind of reasonable queries. Our most simple queries became monster of complexity.
Most generated SQL queries are more than 300 lines long, using nested select, without any joins. Nightmare for the Postgres query planner.
This is what a simple explain analyze was giving us for a basic workflow overview request :

As you can see, this plan is a mess. Only function scan everywhere. This is 100% our fault.
Obviously, Postgres could not run this plan fast, most simple queries took second to run. And for our new client, most of the queries would time-out because all the different seq scan would take too long…
At this point we had 2 options to optimize this mess :
- try to dig into the schema code, find way to remove functions and better use of Postgraphile feature
- rewrite the problematic resolver from scratch
We spent some time analyzing Postgraphile, and how we could implement change in our code base. We could develop a lot of custom plugin, change the schema…
This may have work, but this would have meant delegating the core query logic to Postgraphile and adhering again more into this library.
Stratumn long term vision is to try decoupling as much as possible critical hot path from generated code. And keeping Postgraphile here would have meant breaking this rule.
We decided to keep Postgraphile for most of the simple queries, but switching the core query to code we can monitor, analyze and fine tune.
In this case, Postgraphile would forward all the “querying of the JSON” to a custom made resolver. We kept most of Postgraphile feature, and it is still resolving most of the other piece of data of our software, but not this one.
The workflow overview, querying trace is delegated to custom code.
Create a custom query engine is not really that hard, you only need to map you custom DSL filter logic to where , order and group by clause. The only tricky part for the Stratumn case we to handle the JSONb part. Filtering will be done only on data contained in the data JSONb column. Only JSONb operators.
This helped me really understand the fragile link between GIN and the gin_trgm_ops .
The query engine is written with indexes in mind. It’s primary goal is to generate SQL queries which will hit as many index as possible.
Indexes are created by database admin (we could also create them automatically based on the column of the big data table we are filtering on), with the help of product, explaining what kind of filtering the user would make. This work best because this helps the database admin make the best choice when choosing which operator to use. Every index is made by hand because it is mapped to a special JSONb path in the JSON data.
CREATE INDEX custom_index_42_assignedStatus ON trace.trace_state USING GIN ( (This index can only be hit on the workflow where the JSON data defines the global.assignedStatus path (by design of the workflow configuration).
When the user performs a query, the engine will generate SQL queries like this one :
select This query is much simpler than the old one and also, it will hit many indexes at the planning phase :

This led to very significant improvement again (from >2min to less than 300ms query time).
First, Postgres is an amazing piece of software and when you get bad performance out your queries, 99.99% (even more tbh) of the time it’s your fault.
RLS are a very good Postgres feature to ensure security, but they can lead to very degraded performance when implemented the wrong way.
Indexes are the most important tool in Postgres, you should over optimize to hit them. JSONb is far from being the best way to store data, but when you don't have the choice, you can also index them : be careful with the type of operator you use in Postgres, some of them are not indexable.