So for example, if you said "I want to start a library, donate any books you have at my house" that would not be a "curated" collection in my opionion. From a curated selection of tools, I expect to see a selection of the best tools, chosen by a knowledgable curator who has evaluated the tools in some way. To curate a collection is to be an editor, determining what to include and what to exclude. If you have the kind of system that is either sufficiently distributed, or heterogeneous enough that you can't use existing schedulers, you need something with #A, but if you also need complex job management, you need #A, #B and #C, and having rebuilt my own my times, using a standard system is better when coordinating between many engineers. These are a mix of wrappers to talk to different compute systems and actual compute mechanisms themselves. They also try to have "premade" job stages or "operators" for common tasks. And you need the state to track and retry stages independently (especially if they are costly)ĭ) What many of these tools also try to do, is tie the computation performing a given job to the scheduling tool. Most scheduling system has some way to do deal with failure too, but these tools are now treating this as stored stateĬ) DAGs, complex batch jobs are often composed of many stages with dependencies. Job can fail and may need to be retried, people alerted, etc. If you don't have a unified runtime for your system, the old ways of using Unix cron, or a "job system" become complex because you don't have centralized management or clarity for when developers should use one given scheduling tool vs another.ī) Job state management. Higher-order abstractions can be a productivity boon but have costs when you fight their paradigm or need to regularly interact with lower layers (in ways the designs didn't presume).Īirflow and similar tools are doing four things:Ī) Centralized cron for distributed systems. The art of software engineering is all about finding the right abstractions. If you're using Python for your tasks, it also includes a large collection of data abstraction layers such that Airflow can manage the named connections to the different sources, and you only have to code the transfer or transform rules. For example, maybe you want to garbage-collect some files on a remote server with spotty connectivity, and you want to be emailed if it fails for more than two days in a row.īeyond those two, Airflow might be very useful, but you'll be shoehorning your use case into Airflow's capabilities.Īirflow is basically a distributed cron daemon with support for reruns and SLAs. The next best reason to use airflow is that you have a recurring job that you want not only to happen, but to track it's successes and failures. Or maybe you want weekly statistics generated on your database, etc. For example, you might want to ingest daily web logs into a database. The single best reason to use airflow is that you have some data source with a time-based axis that you want to transfer or process. I don't think it's particularly clear, however, as to when to use airflow.
0 Comments
Leave a Reply. |
AuthorWrite something about yourself. No need to be fancy, just an overview. ArchivesCategories |