Research

The Build: Migrating Data Direct to Google BigQuery

6 July 2021

The Build: Migrating Data Direct to Google BigQuery

Share

The Build is dv01's technical blog, where we share posts about our experience building, maintaining, and scaling our products and data infrastructure.


Over the past two years, dv01 has experienced accelerating growth. When the pandemic hit, we saw usage increase as users flocked to understand their risk. As usage and our volume of data grows, we knew modernizing our data warehouse was critical in order to meet user demand and help us scale in the future.

In this post, we discuss our experience migrating from Microsoft SQL Server to Google BigQuery in the context of our data-as-a-service product, Data Direct, and why we are so excited for this change.

What is Data Direct?

As the name implies, Data Direct gives users direct access to the loan-level data dv01 receives from loan data providers. Users can connect to the data via a dv01 hosted database or SFTP server. Those that work with large amounts of data know firsthand how time consuming it can be to collect, clean, validate, and normalize data. Data Direct lets our users reclaim that time, allowing them to focus on working with data to be more productive.

Data Direct

Click to download the Data Direct information sheet

More Data, More Problems

Currently, dv01 processes over 700 securitizations and 90M+ loans with more being added every month. While users were excited to see new datasets on the platform, their query times became longer, especially with complex queries.

Additionally, as the workload on our self-managed SQL Server grew, it was requiring more maintenance from the team, which took away from other projects.

As our data grew both in size and complexity, we realized we needed a database solution that met the following criteria:

  • Scalability

  • Cost-effectiveness

  • Low management overhead

  • Ability to handle peta-bytes of data

After much research, we decided Google BigQuery is the best solution for our needs. Here’s why:

  • Scalability:  go from 0 cores to 3000+ cores and then back down to 0 in less than 1 minute

    [1]

  • Cost-effectiveness: Choice of on-demand or fixed-price pricing

    [2]

  • Low management: Configuration and management of servers & data storage is handled by Google

  • Ability to handle peta-bytes of data: Petabytes of data can be analyzed in minutes and gigabytes in seconds

Google BigQuery is a server-less, scalable data warehouse that enables super fast SQL queries by harnessing the processing power of Google's infrastructure. Because the database is managed by Google, we are able to ensure high availability while spending less time managing a SQL Server instance. BigQuery is designed to quickly scale the amount of processing power dedicated to a query when needed, giving it the ability to process petabytes of data without a hiccup.

While BigQuery meets all of our needs, there is no single best solution for every use case and there are always tradeoffs to be made. One drawback users will notice when moving from SQL Server to BigQuery is that there is a minimum spin-up time associated with each query, no matter how big or small. Overall, we felt that our users would benefit from the switch to BigQuery, as many of our users were running queries that took several minutes on SQL Server.

Client Transition

The entire migration process for Data Direct—from planning to execution—took 6 months. While we knew our users would greatly benefit from the transition, we needed to make sure it wouldn’t drastically disrupt their workflow. As such, we reassured our users that the data infrastructure would not change—schemas, column data types, and table names remained as is.

However, there were a few things that had to change and would require users’ behaviors to change:

1. New SQL syntax required: Because SQL Server uses a variation of ANSI SQL known as T-SQL, it was not possible to copy and paste all valid SQL Server commands to BigQuery, which uses standard SQL. Square brackets, TOP and other syntactic differences required changes to client queries.

SQL-Diagram

2. New connection libraries setups: For clients connecting via R, Python, or any other language, new connection libraries would have to be utilized to properly access BigQuery.

3. New IDE: For interactive queries, the easiest way to connect to BigQuery is directly from the web browser. While we found this to be an improvement upon having to install and configure a local IDE, users accustomed to working in SQL Server Management Studio would need to change their habits. We chose superQuery as our preferred web IDE, as it is designed specifically to make working with BigQuery more productive with features, including

  • Customizable tabs and boards

  • Autocomplete

  • Easy access to query history

  • Charting tools

Making the Move

Since we knew our users may not be familiar with BigQuery or superQuery, we wanted to make sure they felt confident in their ability to adapt to the new workflow and continue to trust us as their data provider. To accomplish this, we did the following:

  • Announced the transition

     

    before

     

    fully migrating. We gave our users a two-month notice, which gave them ample time to connect, ask questions and translate SQL queries.

  • Created detailed documentation as a quick reference. This was especially helpful for users who run automated queries. Using our sample code they could swap out a few lines and be up and running in minutes.

  • Held client meetings to demo the new functionality as well as follow up meetings to help translate SQL and update connection code.  


A Bright Future

One month after the transition to BigQuery, dv01 has had zero downtime and our clients are querying more data than ever. It is not uncommon for a query that took minutes on SQL Server to take only seconds on BigQuery. By transitioning before slow query performance was widespread, we were able to minimize the impact on our clients day to day workflows. We expect Google BigQuery to handle any data we onboard, which will be especially important as Data Direct is expanding to support multiple product lines, including our Tape Cracker product, and dv01 expands into new markets. The positive feedback we received has been encouraging, and while the process of migrating was a rewarding challenge, we are looking forward to scaling with BigQuery.

- Authored by Brock Donovan, Senior Data Engineer


  1. https://cloud.google.com/blog/products/gcp/understanding-bigquerys-rapid-scaling-and-simple-pricing

  2. https://cloud.google.com/blog/products/data-analytics/choosing-bigquery-pricing

newsletter-image@3x

Receive the Latest Insights

Subscribe to our mailing list to stay up-to-date with the latest market insights and product updates.

Subscribe to Newsletter

RELATED POSTS