Airflow

Expand all | Collapse all

DAGs stuck in Running state

  • 1.  DAGs stuck in Running state

    Quboler
    Posted 04-03-2019 15:59
    Edited by Gokul Kumar 04-03-2019 16:26
    *This Question was asked by a customer as part of support ticket. Posting it here, might be useful for other folks in community as well. *

    We have multiple DAGs in your Prod account that are stuck. We're seeing that one task has been stuck in Running state for a long time so no reruns of this task are being scheduled anymore.



    ------------------------------
    Gokul Kumar
    Qubole
    ------------------------------


  • 2.  RE: DAGs stuck in Running state

    Quboler
    Posted 04-03-2019 16:23
    There are 2 components to Airflow:

    1. Airflow Webserver
    2. Airflow Scheduler

    The Airflow scheduler monitors all tasks and all DAGs, and triggers the task instances whose dependencies have been met. DAGs not being scheduled or tasks being stuck are generally due to an issue with Scheduler.

    We found that the Airflow Scheduler in this case had crashed and all attempts by Monit to restart it had failed. Any attempt to restart Scheduler manually resulted in same crash with the below error:
    File "/usr/lib/envs/env-1269-ver-0-a-4.2.9-py-3.5.3/lib/python3.5/site-packages/apache_airflow-1.8.2-py3.5.egg/airflow/jobs.py", line 1025, in _execute_task_instances
        open_slots = pools[pool].open_slots(session=session)
    KeyError: 'trip_report_pool'
    The error shows that the reason for failure was because the DAGs were referring to an Airflow Pool 'trip_report_pool' which the user confirmed did not exist anymore. After digging deeper the user found that there was a reference to this pool in one of the older DAGs causing the failure.

    The cause of Scheduler failure was due to an open source bug referenced here: [AIRFLOW-1157] Assigning a task to a pool that doesn't exist crashes the scheduler - ASF JIRA which caused the scheduler to crash if it finds a pool that does not exist. The issue was resolved after removing all references to 'trip_report_pool' across DAGs.

    The fix for this will be going out as part of next release (R56).

    Additional Details:
    • How to check if Airflow components (Webserver, Scheduler) are running ?
      • Scheduler: Login into the Airflow cluster master and run sudo monit status scheduler  
      • Webserver: Login into the Airflow cluster master and run sudo monit status webserver
    • How to restart 'Scheduler' or 'Webserver' ?
      • Login to Cluster Master and run sudo monit restart scheduler or sudo monit restart webserver


    ------------------------------
    Gokul Kumar
    Qubole
    ------------------------------