Coauthored by Chris “Cool” Hand “Luke” ft. Mathew “Big Money” Sella
There’s a new and increasingly popular saying in software development that “memory is cheap” and optimizations are deferred or ignored because of the tooling available. Basic software development can get away with some bad habits, but if your system has the possibility of being challenged by volume or scale, you’ll likely wish you had put more thought into it. There’s a strong sense of satisfaction in building a system that can handle a tremendous amount of load. It’s similar to an engineer building a structure that can handle ten times the weight it was originally designed to hold, simply because it was well built. Here we’ll dive into some considerations any team should know of when they want to approach handling big data.
What is big data?
In the world of data, scale is relative. Big data to some teams may be hundreds of records, while for others it may be billions or trillions. Depending on the record in question, each of these scenarios can present their own challenges. If the data you handle is consistently growing, even if it isn’t an impressive amount right now, you’ll need to be smart about how you’ll handle it.
Define Your Process
First, understand your data and process.
What kind of data are you working with?
You could be:
- Querying a database with millions of rows
- Storing thousands of files
- Extracting information from thousands of files
- Processing a batch request of hundreds of records that each require a multi-step workflow
Each of these scenarios will have opportunities for optimization, but step one to having a good process up front is to recognize what kind of data you’ll be dealing with.
After this, you need to understand your process, answer some questions such as:
- Is it (can it be) a background job?
- How quickly do you want to give a response to the user?
- Are you persisting information as a result of this job, or passing it straight through?
Then, more practically and technically, you think to consider what kind of operations this process will require:
- Will your operation require a lot of I/O (input/output) operations?
- Are you going to be performing large, time-consuming queries?
- Do you have to store a tremendous amount of data as a result of this operation?
- Are you going to require storing a lot of information in memory?
The answer to many of these questions will depend on the approach you take and play a large part in how efficient your process will be. For instance, some core optimization rules would be to limit your interactions with a database, such as inserting many records all at once and establishing a single connection that can be reused. However, this may mean that you increase the amount of information your process must hold in-memory, and these are the types of tradeoffs you’ll have to consider when defining your process.
Try to define each step of your process and break them into small, discrete operations that can be examined and run independently. Good system design encourages this with any development, but when you’re dealing with heavy loads it becomes even more important. This also gives you the opportunity to have specific tools or optimizations for the steps that may become your bottleneck.
Pick the right tools
After having understood your process, it’s time to pick the tools you’ll be using to perform these operations. Be open minded when considering which language, framework, or platform may be the best fit for your operation. Some languages are better suited for specific operations, and even if you segment your codebase a little bit for the sake of being efficient, the tradeoff may be worth it!
If you find you have a process with multiple steps, consider treating each of them as an opportunity to pick the right tool for the job. If you use something like AWS Lambda, each step of your process can be handled in a different language. You could have Python, Node, Go, and C# in the same workflow. Not necessarily something we would recommend, but you should explore which language fits the step you’re working on!
Identify your bottlenecks
Every process will end up having a single step that takes the most time. Often this one step will be the most time-consuming by a factor of 10 or more. This can present an opportunity to make significant progress in improving your workflow. Why spend time optimizing a step that takes seconds, when you have another that takes minutes? Focus on the steps with a potentially high return on investment.
Another consideration when looking at your bottlenecks is to examine whether there are ways to organize your flow in such a way that some shorter processes can run in conjunction with your most time-consuming step. For instance, if your users are waiting for validation of data, you may be able to immediately respond with the validation of results and start a separate job to process the data.
Choosing the right language
Softrams just recently set a new benchmark for the number of records it was able to extract from almost 2500 files when it processed 4.26 billion records in a single data extraction. These operations involve a lot of Input/Output operations, and originally was written in NodeJS. Deciding to take advantage of the multi-threading capabilities of GoLang, we were able to reduce the time it took to process a single file from almost 16 minutes a file, to about 6 minutes a file. Over 2500 files, this type of change saved the process over 400 hours of processing time!
Letting the database do the work
One of our engineers worked at a company that needed to process between 200,000 and 500,000 records ever hour. Originally this job was written in PHP using a PostgresSQL database, and the process would load chunks of records into memory before processing them. This meant that they could only process a certain number of records at once, and the entire process took almost twenty minutes every hour. One engineer was able to move a core piece of processing logic to the query that fetched these records originally, which substantially reduced the computations that were required in-memory. This reduced the processing time from nearly 20 minutes to less than a minute per job.
Memory is cheap, but don’t limit your system by neglecting a sound approach when dealing with large data. Be intentional about your process and set yourself up for success to scale 10x when you need to.