Why your solution needs data pipelines

Do you want to scale up your solution? Or benefit 130% from cloud capabilities? Consider using data pipelines. The majority of the work in building a new AI solution is often in the development phase. The hard work appears to be done after designing algorithms, writing code and training AI models. Right until you need to process 1000s of GBs of data. Unless you have a scalable pipeline in place, managing such a task can be a pain.

What is a pipeline?

Most AI solutions do not consist of 1 single step. Data needs to be pre- and post-processed before real value can be added. A best practice is splitting up these processes in development. It offers opportunities for reducing processing time and costs & allows for scalability. To maximally utilize these opportunities one can use a pipeline. 

A pipeline is an architecture that links multiple processes together. The design choice relies on an orchestrator and workers:

  • The orchestrator administers the work. It keeps track of all tasks & assigns them to the relevant workers. The orchestrator is a lightweight process, much different from the workers.
  • A worker, as the name suggests, handles the actual work. It executes tasks on demand. The hardware of the worker can be tailored to the needs of the specific task on that worker. Workers running deep learning models can be equipped with GPUs for example. Those performing smaller tasks can be equipped with much less resources. 

Together the orchestrator & workers can handle complex workflows. One of the impressive capabilities of a pipeline is branching one task into multiple. They can execute them in parallel and merge results back together.

Why use a pipeline?

The main reason for implementing a pipeline is scalability. The real benefit of using an orchestrator lies in parallel computing instead of sequential. Parallel computing allows for endless scaling possibilities. 

Processes can be run in parallel in different ways. First and foremost, by splitting up data into smaller chunks the processing time of each chunk can be reduced. Running some or all of these chunks at the same time drastically reduces the time needed to process an entire dataset. Take for example image classification. Instead of processing everything at once, images can also be processed in batches, in parallel. Secondly, when different types of processes require the same input data, why not run them in parallel? When two models need to process the same image, they should not have to wait for one another.

Complex AI solutions thrive on well-implemented parallel processing pipelines. When parallel processing is employed to its fullest potential, the maximum run-time of an entire dataset could  be reduced to the duration of the largest sequential task chain. Keep in mind that there are limitations to this. One of these is the availability of computing power and the depth of your pockets. Nevertheless, significant time savings can be achieved.

Cloud computing introduces a second advantage when implementing pipelines: cost reduction. On-demand-compute allows for high utilization when there is large demand. When there is no work to be done, no costs are incurred. This ensures that a pipeline can be scaled up without introducing additional costs (compared to the linear case). Additionally, tuning the hardware of worker machines to their workload can maximize efficiency. When not all processes are equally intensive, usage of expensive resources needs to be minimized. Tuning the setup of each worker node to a specific process can minimize the time a worker is underutilized.

Another advantage is job monitoring. Monitoring the status of a running pipeline can give valuable insights. It allows you to predict remaining processing time and identify relatively expensive tasks. Since all failed processes are logged, it is possible to easily track down & isolate bugs. After fixing these, only the processes with those specific conditions need to be executed again. Data passing through the pipeline can successfully be skipped. 

Many of the products we work on rely on scalable pipelines. Anything from detecting solar panels on roofs to inspecting trees and guard rails with LiDAR data profit from these new technologies. Our responsibility is to enable scalability and more-than-rapid execution of critical business processes for our customers and users. Check the software page for more information.

Skyrocket your application?

Niek studied to build rockets, now he builds awesome AI software. Take off with him via n.vandenbos@sobolt.com

Comments are closed.