Table of Contents
Background
Planning & Preparation
Doing the work
Reflections & Wins
As Traba went through hypergrowth, we started to experience the limitations of our Firestore NoSQL database as our data quickly became more relational and connected. We unpack the learnings from our quest to incrementally deprecate our Firestore usage and move our data to Postgres to lay the foundation for the next phase of company building.
Background
Traba’s story starts in the summer of 2021 with a janky Google sheet powering our “backend” and a meeting with the mayor of Miami to talk about our new app—promising thousands of well-paying jobs to Floridians—scheduled for 3 weeks from the day. Our CTO, Akshay, and founding engineer, Moreno, had been heads down putting the puzzle pieces together for a relational schema that captured the system at the time, but with the meeting quickly looming, they scrapped the schema and got set up on Firebase in a day.
…and so the story goes for thousands of startups before us, and likely thousands after. Firebase has been enabling young companies to go fast™ for over a decade and has gained a reputation in the industry for being the go-to solution for rapid prototyping.
It’s easy to see why when you look at what they offer:
An easily modifiable data model
Horizontal scaling by default
A tightly integrated ecosystem complete with cloud functions, cron jobs, and a local emulator
A generous free tier
Great performance for single reads and writes
The hard parts get abstracted away so you can focus on the make-or-break part of building a startup—finding product-market fit.
In this post we’ll dive specifically into Traba’s usage of Firestore, Firebase’s NoSQL cloud database.
In the fire
Firestore got us far. However, our system had gotten increasingly complex and our data was far more relational than it was at the start. As anyone who has used Firestore can attest, querying for relational data is not fun. After a year and a half of building, some detrimental ways in which Firestore was hindering engineering velocity and feature development started to bubble to the surface.
But before we get into those, some quick context! Traba is a tech staffing platform that helps our business customers (think warehousing, manufacturing, and distribution) find flexible workers to handle fluctuations in demand. Workers see shifts posted by businesses on the Traba mobile app and can sign up for them. In Firestore, we had companies
, workers
, shifts
and shift_signups
collections to represent these concepts.
Fuzzy searching
The workers
collection contained worker documents with fields like firstName
and lastName
, among others. Our operations team regularly needs to do one-off searches for workers with all kinds of filters. For example, they might filter for workers using a substring on either their first or last name.
Let’s say we were searching for a worker but the only detail we remembered about them was that the substring “ave” was somewhere in their full name. With Firestore, we couldn’t do fuzzy matching; we would need to pull all worker documents into memory and filter them out locally, returning all documents that had the substring “ave” somewhere in the firstName
or lastName
fields. This became borderline impossible once the documents started numbering in the hundreds of thousands – we regularly ran into memory bottlenecks and performance issues.
We just want the "Dave Smith" and "Sarah Graves" documents here, but unfortunately we need to also pull the "Blake Jones" document and search through all documents in the backend code
Relational searching
The above example just involves a single collection/table though; let’s look at something more involved.
Something our operations team would regularly need to do is look for workers that:
Worked on a shift last week (i.e. from date A to date B)
Are not currently busy (i.e. do not have an active shift they are working on)
Are not blocked by a specific company (companies can block workers they no longer want to work with)
In addition to the collections mentioned above, we also had an associations
collection that captured the relationship between companies and workers (i.e. whether a worker was favorited or blocked by a given company).
In order to actually perform such a complex query, we would need to make multiple calls to Firestore:
I hope you didn’t read all of that, but if you did, I’m so sorry. This type of code was spreading all over the system as we added more complex features and it wasn’t fun to write nor review.
Firestore’s limitations
With the above two examples, there are workarounds. For the first example, you could connect Firestore to a full-text search solution like Algolia or ElasticSearch. For the second, you could figure out what popular queries are being run and begin to store extra data to make those queries more easily runnable in the future. Both cases work but we didn’t feel they were ideal.
The two examples also are NoSQL issues; they’re not specific to Firestore. But Firestore did have a few design choices that made other types of queries more challenging:
There was no way to count documents; we had to fetch all documents to get a count (or store the count on a separate doc and constantly update it, which was error-prone)
There was no way to do partial matching (SQL has the
LIKE
operator)There was no way to easily query large arrays (unfortunately, the
in
operator only let you search up to 30 entries)There was no way to query with multiple inequalities (this was probably the biggest hurdle to need to work around)
In other words, Firestore didn’t allow any query that would require scanning across multiple indexes
While Firestore was great for scalability, more and more some of the design choices made to achieve that were becoming glaringly obvious and difficult to work around.
Exploring alternatives
Unfortunately for us, these examples were just a few of the ~20 or so critical endpoints and features we determined were getting hamstrung by Firestore usage. Conversations began sprouting up all over about what a switch to another system would look like.
Contenders
A few different technologies were thrown around: we talked about MongoDB, MySQL and Postgres, among others.
The issue with MongoDB was that although we might not get affected by some of the Firestore-specific limitations, we would still need to be hitting increasingly relational nails with a non-relational hammer. We quickly ruled it out.
We also spent some time talking about MySQL vs Postgres. Both seemed good; some heavy-hitters like Slack and HubSpot embraced MySQL early on and didn’t look back. However, Postgres is currently very popular so we found that there was a lot of good modern documentation and advice on adopting it. Furthermore, we read a few war stories about companies needing to adopt “MySQL’isms” in their code to support confusing default MySQL behavior.
Intent on making our migration as easy, fast and safe as possible, we ended up opting for Postgres.
Firestore limitations, revisited in Postgres/SQL
Knowing we wanted to switch to Postgres, what would the previous limitations explored look like in a relational world?
Fuzzy searching
With a dedicated workers
table and the power of Postgres regular expressions, the fuzzy search query would get reduced to:
Here Postgres will only return to us the 2 rows we are asking for; we don't need to pull everything into backend code!
Relational searching
Compared to the multiple Firestore queries for getting the workers list, a similar query in SQL might look like:
It’s no doubt still complex, but using the power of joins, we can accomplish it in a single query. We just need one network call and database connection to get and return all the relevant workers. Furthermore, we can write the query in a way where we only get the data we need. Remember that old quote from Joe Armstrong?
"You wanted a banana but what you got was a gorilla holding the banana and the entire jungle"
Yeah... that’s what Firestore queries felt like.
Postgres’s query magic
All of the limitations of Firestore mentioned above? They would be non-existent in Postgres. Postgres is optimized for complex queries and it would be actively difficult to push its limits at our current size.
Planning & Preparing
Breaking it down
First, we had to get an idea of what success looked like. Our north stars were:
Increasing development speed and ergonomics: We didn’t want the changes we adopted to feel worse for our engineers. The Firestore API was incredibly easy to use; were there relational-friendly tools that had as simple an API?
Increasing data integrity: With the switch to a relational schema, we wanted to be able to move away from hundreds of conditional checks in the code to see if a field even existed on a document or not.
Improving query speed: Data integrity is important, but we also wanted everything to just be faster.
Improving external integration options: We hoped we could set up our new tools in a way that made not just engineers’ lives better, but also those of our operations team. We (at the time) used a lot of Retool workflows and still regularly use BigQuery, so one of the goals was to improve these external integrations to make sure our partners were unblocked.
Avoid compromising on the existing system: We were aiming to do this migration with zero downtime. Furthermore, we wanted our efforts to be iterative; if we could get engineers partially building in the new system before the full migration of the old system was complete, that would help to get us all familiar with the new way of doing things well before the full migration was complete
Next steps included figuring out which tools we were going to use, and after that, deciding on an implementation plan.
Tooling
Hosting provider
We spent a week researching providers to host Postgres on. We were debating between Google Cloud Platform’s Cloud SQL offering and Finland-based Aiven. Cloud SQL is a managed database service supporting MySQL, Postgres and SQL Server; Aiven is more of a multi-cloud data platform that manages both SQL or NoSQL databases for you (hosted on the big cloud providers like AWS, GCP, Azure, etc). We were already invested pretty deep in the GCP ecosystem and Cloud SQL would let us fine-tune database settings in a much more granular way. However, when speaking with Aiven, they made a solid first impression with their fast response times and support from a dedicated solutions architect. Furthermore, we were debating using Kafka as our Postgres → BigQuery ingestion channel, but the Aiven architect we worked with showed us a unique way of connecting GCP’s Datastream through Aiven such that we could avoid the estimated ~$500/mo Kafka bills.
Aiven offered a bunch of other goodies too:
A (basically) fully managed solution with already optimized Postgres instances
Great backup and replication products
Out-of-the-box pool management
A quick turnaround time for supporting new Postgres features and updates (with the ability to control when these happen)
Multi-cloud solutions to avoid vendor lock
We went with Aiven. Ultimately, the decision came down to flexibility and control vs time saved. We opted for the solution that would allow us to spend less time fiddling with complicated internal settings and PgBouncer configurations and focus on building great products.
ORM
A much quicker decision was if we should use an ORM or not, and if so, which one. We quickly decided yes for the former; the extra sugar around managing migrations and having type safety were too good to pass up. For the latter, we were debating between TypeORM and Prisma.
We liked Prisma because of the solid documentation and ability to access raw SQL if needed. Furthermore, the Prisma community was large and active; we had no trouble getting responses on Slack. Prisma had releases every month and the support it had made us feel good about fully adopting it as a team.
However, in our research we did find some anecdotes of teams who had used Prisma and ultimately rejected it due to a lack of stability compared to TypeORM, and other teams who were impacted by Prisma deciding to support and then subsequently drop support for Go. The tool wasn’t as mature as other ORMs at the time, and although it’s got a slick Rust query engine under the hood, it was still not the fastest ORM when we were looking. Ultimately, we still decided to go with Prisma as the developer ergonomics felt solid and we didn’t think there was an active risk of TypeScript support getting dropped.
The implementation plan
With the tooling figured out, we decided to split the migration into two parts:
Replicate any new data being written to Firestore into Postgres
Backfill existing data from Firestore into Postgres
We were hellbent on performing a zero-downtime migration, so we came up with two different approaches for copying new data that allowed for that.
Approach #1: Trigger-based replication
Firestore offers a product called “triggers” that lets you handle events in Firestore without needing to do so from client code. It was a remarkably simple approach: the idea was to configure triggers to run whenever a document changed in Firestore. The trigger would kick off a cloud function that acted as a translation layer between a Firestore document and the corresponding row in a Postgres table, and would proceed to insert, update or delete that row in Postgres directly.
Pros
Deferred the execution of the write to Postgres to the trigger, so no extra latency would be added to the original request from our Node server
Separation of concerns was respected; engineers didn’t need to worry about remembering to write code to perform the Postgres write
It was easier to handle exceptions as the trigger could be set to retry without affecting the Node server
Less development time would be needed upfront to understand the current usage in the Node server
Cloud functions horizontally scale with ease
Cons
There would be extra lift in the latter half of the migration, because we would first need to write the replication logic in the trigger function, and then come back and duplicate the same logic in the Node server for when we started fully migrating collections over from Firebase → Postgres
Adds an extra layer of stress on the DB due to the number of connections potentially spiraling and our pool needing to work harder during peak periods. For example, if we did control writes to Postgres in our backend code and needed to write to 4 separate but interrelated tables at the same time, we could simply open up a single connection and perform all 4 writes in a transaction. However, with the trigger-based approach, 4 separate connections would be needed since each collection would have its own trigger and replication logic
As a result, this would lead to increased time and research needed to make sure the pool manager was implemented in a scalable way that would handle spikes and periods of increased load
The replication lag prevented us from being able to use Postgres for reads for latency sensitive features that required real-time data
Approach #2: Transaction-based replication (shadow writes)
The second approach was to perform a 2nd write to Postgres wherever we were writing to Firestore in our backend code.
Pros
This would lead to a faster migration as the code would already exist in our backend Node server. More specifically, it would be easier to kill collections and update writes to only flow through to Postgres instead of both Firestore and Postgres
Much simpler database connection management since we could control it at the code-level
Cons
Would require us to identify every existing call-site where database calls to Firestore were happening (this also needed to happen with the trigger-based approach but just at a later stage of the process)
Would require engineers to need to know for all future call-sites added that a second write was happening to Postgres and that they would need to add that code
Harder to handle retry logic
Needing to write to two databases could add latency
Tradeoffs, tradeoffs, tradeoffs
Assuming we could figure out the connection management piece, we liked the idea of trigger-based replication, but there was still one big problem with it: race conditions.
Consider the following example: Let’s say we have a shift with three worker openings available on it.
The first worker joins, triggering two replications for:
The newly created shift signup
The edit to the base shift document (incrementing its
workerOpeningsFilled
field by 1)
Both replications complete successfully, and at this point, the shift document resembles:
For the next two openings, two workers join close to the exact same time, again triggering two replications each. The replications from the shift_signups
collection to the shift_signups
table complete successfully, but something weird happens with the base shift document replications. We check Postgres and see this:
Wait what?! Shouldn’t workerOpeningsFilled
be equal to 3
?
Due to a race condition, the first replication finishes after the second! This puts our data into a bad state, with workerOpeningsFilled
being ultimately replicated with a value of 2
instead of the intended 3
.
There are a lot of variables such as network latency and computation time that affect cloud function timing; as a result, they are not guaranteed to execute in order. Perhaps the machine performing the first replication needed a bit longer of a cold start than the machine performing the second replication. As a result, the first function was delayed and completed after the second function. Now our data is in a bad state.
Making trigger-based replication work
We were still bullish on trigger-based replication as we believed it would let us go faster at the onset compared to doing shadow writes, but there were two types of race conditions we needed to solve for.
Race condition #1 – Incorrect row data fixed via reconciliation scripts
This is the race condition described above, where certain row data would be out of sync. To get around this issue, we wrote a reconciliation script per collection that would validate the data integrity between Firestore and Postgres and upsert into Postgres as needed. Initially, we would run these manually whenever out-of-sync data was detected, but for our most critical collections we started running these on a cron as a self-healing mechanism of sorts.
When we initially started the migration, most of these scripts would only take a few minutes to run. Towards the end, some of these were taking upwards of 20+ minutes to complete (and that was running from a VM placed very close to both databases; if we ran locally, it easily would have taken 1+ hours!).
Race condition #2 – Data integrity issues (i.e. constraints)
In the NoSQL world, you don’t have to worry about relations between documents at all. But in the SQL world, we had plenty of tables that referenced fields on other tables, and thus the order of insertions mattered. For example, whenever a worker signed up for the platform, we created a worker document for them, a profile document containing some metadata about their account, and a metrics document, among other smaller documents. In Postgres land, due to the relational nature of these, the worker row needs to exist before the profile row does. If a race condition caused the profile trigger to finish first, the replication would fail.
Triggers offered a nice—albeit very brute force—way of solving this: retries. We wrote custom retrying logic for our triggers that would basically keep on trying to insert that profile row until it succeeded, for X number of times (we set X to a high number; often 15+). If even after all those retries an insertion was still failing, we setup a Slack integration that would ping us so we could investigate manually.
Throughout the entire one-year-long migration, that Slackbot amazingly pinged us a grand total of 0 times. The retry logic just worked.
Backfilling existing data
With replication setup, we wrote some robust backfill scripts using our translation logic for each trigger to backfill all the existing Firestore data into Postgres.
Splitting up the work
Of course, we wouldn’t migrate all of our collections at once. The team sat down and figured out the best order in which to kill collections and move the backend logic to support the new Postgres tables and use Prisma. It was a fine balance between aggressively tackling our most critical tables like shifts
, and making sure we weren’t procrastinating on the heavier lifts. We started with some smaller collections to gain confidence in the flow we had put together, and then planned to migrate medium-sized collections in the middle of the process and the large ones like shifts
, shift_signups
and workers
towards the end.
Doing the work
We performed the migration while doing our regular product work; we didn’t have a team dedicating 100% of their time to this project, so everyone was just chipping in when they had bandwidth. The full timeline roughly looked like:
January: We started with benchmarking what peak replication loads would look like. We did end up needing to tweak Aiven throughout the process to ensure no connections in the pool got dropped, so it was best we figured out early on that a large number of concurrently executing cloud functions could overload our database. After that, we began writing all the trigger-based replication cloud functions and other essential tooling.
February – March: We completed the first full migration of a collection to Postgres, our worker_metrics
table. This was a massive gain for us. In Firestore, we initially had to do a whole bunch of data wrangling to generate these worker metrics; now, we had it in a single query.
April: We re-analyzed our data and found that we were successfully replicating data into Postgres with 99.99% precision. We also built our first large feature completely in Postgres: worker search. In the months leading up to the migration, the number of workers signing up and using our app had been exploding. As alluded to in our example above, we had to write more and more Firestore queries to support all the filters needed for a useful search and the endpoint continually worsened in performance. Needless to say, our operations team had been asking for an optimized worker search endpoint for a while. Because the feature didn’t need to be 100% strongly consistent and could tolerate a couple hundred milliseconds of replication latency, we were able to get away with using Postgres for reads here which dramatically brought latencies down on the endpoint.
May – October: We were full steam ahead with developing new features using Postgres and migrating the bulk of our Firestore collections successfully over to Postgres. We were able to knock out some critical collections like users
and companies
.
October – December: Eventually it came time to do two of the scariest migrations: shifts
and shift_signups
. These two greatly impacted all business operations; shifts
touched everything and shift_signups
touched worker payments. We really needed to get this right.
The trinity test
We decided to start with shifts
instead of shift_signups
for a few reasons:
shifts
was a tiny bit smaller and safer as it didn’t touch worker payments directlyThe learnings gained from doing
shifts
would help withshift_signups
considerablyWe wanted to test what switching both reads and writes to Postgres at the same time would look like (elaborated upon below)
Infra team member Mike Staunton started dutifully plugging away at the code changes necessary for this migration, identifying hundreds of code call-sites where we had been making collection calls to shifts
in Firestore and translating them to efficiently use Postgres. There were a lot of gotchas that came up around managing transactions, dealing with side effects and limiting the number of database connections spawned, but after a few weeks, Mike and the infra team figured out where the issues were and resolved them.
Along the way, the team realized they would need to switch both reads and writes to Postgres at exactly the same time. For other collections, after migrating the code, we had just been flipping on reads, and only after letting that bake for some time, writes. We couldn’t do that for this collection due to the potential for race conditions putting us in dangerous states. For example, let’s say we had reads set to Postgres, but writes set to Firestore. Due to the delay on the trigger-based replication, this would mean writes were always ~1 second behind. If we had a shift with 9 out of 10 worker openings and a worker joined it, worker openings would get set to 10
after 1 second. If in between that time another worker tried to join, the read would still return 9
since the replication hadn’t happened yet. By the end of it, we’d have a shift with 11 workers signed up, although the shift only asked for 10.
Because this collection was so critical, the team decided to take extra precautionary measures. We wrote a rollback script that could recover data to Firestore from Postgres (while also temporarily freezing all requests so data wouldn’t get out of sync during the recovery) should the need arise. We also used Artillery to blast our dev environment, which was already using Postgres, to ensure that it could handle the increased load. The last step was for Javi, our analytics Yoda, to do one extra data validation on top of the reconciliation scripts for us to be sure the data was aligned.
Finally, it was time to flip the feature flags and turn on Postgres reads/writes for shifts
. On a cold NYC Thursday morning, the infra team started asking around our operations team to get the least risky time to do the migration. A time was found and all the relevant parties were alerted. Trinity by Ludwig Göransson from the Oppenheimer soundtrack was played, and right at the moment the explosion occurred, Mike flipped both flags.
Real image of Mike flipping the feature flags, bringing us into the promised land.
Despite the drama, we thankfully did not ignite the atmosphere, and the rollback script remained unused. It wasn’t completely magic though; some data (particularly worker openings fields) did get out of sync and we did run into some performance concerns which artillery
didn’t catch, so there were definitely some late nights ahead to fix all those issues. Regardless, we largely considered it a resounding success, and were able to take our learnings from the things that didn’t go optimally and apply them to the shift_signups
migration in the following weeks. Just like that, the worst, most stressful parts of the migration were behind us.
Unknown unknowns
Despite all the research, at the end of the day, we knew we’d run into unknown unknowns while actually doing the migration.
The first hurdle was needing to write code in a way that minimized database connections. In Firestore land, we never worried about managing database connections. As a result, we had a lot of code where we had JavaScript .map()
calls that were getting Firestore documents. When we translated those directly to Prisma/Postgres code, there would be N connections used, where N was equal to the size of the array we were mapping over. We started seeing large spikes in connection usage on Postgres, and quickly realizing what was causing it, needed to switch from a single-object code model to batching to bring down the connections used.
The second happened around 8 months into the migration. We were heads down on rewriting things with Postgres/Prisma but were getting increasingly crippled by the state of our backend system. There were a bunch of dependency cycles, Firestore was being called from many different locations (there was no coherent data access layer) and there was a lot of code duplication.
We decided to invest a month of time to a dedicated refactor where we:
Removed all of the dependency cycles
Culled instances of code duplication
Moved all code that accessed Firestore to dedicated “collection” files (which in turn made collection migrations so much easier since we could just focus on translating each function in a collection file over to its Postgres/Prisma variant)
Removed old Firestore triggers
...and a whole bunch of other stuff that took time in the short term, but ultimately saved us time in the long run.
Reflections & Wins
Exactly 1 year from when we started, we killed the last collection, workers
, and migrated it to Postgres. Throughout the year, we successfully migrated our entire system with high precision and zero downtime, all while not hampering our product development velocity.
An all around better experience
This project was a huge win for us:
We massively improved developer experience and ergonomics. Engineers here at Traba prefer how much less wasteful using a relational system for highly relational data feels than a NoSQL system, and we haven’t heard any sentiment expressed on missing the Firestore APIs
Everything is faster. Seriously, our dashboards load faster, our operations team is happier, and critical read-heavy endpoints are down from 10+ seconds to <1 second
Firestore does pricing based on number of reads, writes and deletes. Naturally, as we scaled our business, costs started rapidly increasing. The migration allowed us to slash 70+% of our Firestore bill, and because our new Postgres instance was billed hourly instead of on usage, we set ourselves up for success down the road
Learnings
Database migrations are long and arduous and stressful. There were a few things that stood out to us through:
Identifying race conditions from the onset is incredibly difficult
It’s easy to overlook the management of database connections
When performing big migrations, you have to account for how the company and the product will change during the course of the migration, otherwise you may need to revise your migration plan midway through
One big example where this almost hamstrung us was with those reconciliation scripts. As mentioned, towards the end of the migration some of them would take 20+ minutes to run, not to mention we needed to keep updating them as we were building more product and adding more fields that needed to get synced over. Although we were extremely fortunate to grow at the pace we did during the migration and have the kinds of scale problems we were having, had we waited around any longer it’s entirely possible those reconciliation scripts would’ve been infeasible to continually run
Even though we were only around ~12 engineers by the end of the migration, it can be hard to enforce things like remembering to add new fields to a replication trigger when it’s not the immediate primary database being used and referenced in the code. We established alignment across our engineers early on to make sure we didn't leave any stones unturned
We (still) 💜 Firestore
...and we still use it regularly for a bunch of other things, including our lightweight form of feature flags and wherever it still made sense to use NoSQL.
We also continue to heavily use other parts of Firebase such as cron jobs for pre-computing data and metrics, and scheduled tasks/jobs.
Our system architecture in a nutshell, using both Postgres and Firebase in conjunction
Thank you team!
We couldn’t have shipped this migration without these phenomenal infra team members: Allison Hojsak, Arvind Anand, Mike Staunton, Moreno Antunes, Nazer Hasanian, and Tara Nagar. And a huge thanks to Javier Rodriguez who managed our entire analytics stack during the whole migration, keeping all of our pipelines and ETLs green and the operations team in sync with the changes our engineering team was making.