You have a troubled caller on one hand and system on the other. Not every day is a Sunday, not every job in Mainframe will run for XX hours. Well, we should be prepared to expect the unexpected. Coming back to the troubled caller from the application complaining about a run-away job, easy way to calm him/her down would be :
Assuming you have some great DB2 Admin tools at your disposal like the CA, IBM, BMC or any other.
1) Gather the job name, other information from the application team member on the call - JBAPLN01, PRDN - subsystem, Job running for more than 13 hours
2) Search the job for the Plan/Program which is being executed by the job - PROG040
3) Go to the tool ( query monitor in my case ) at your disposal – Check for the number of calls being made by the program
4) Make Observations :
Last run – this job completed in 1 hour with # of SQL calls being 4M
Today – this job was still running for over 13 hours with # of SQL calls being 14M on APLN_TABL001
What is causing an approximate 4 times increase in the # of calls made by this program ?
5) Formulate possible reasons ( I have given my findings as well )
Code changes in the program ? – Application team confirmed that there were no changes
Bad SQL query ?– It was checked even though the increase cannot be caused due to bad access path – void and proved by performing EXPLAIN on the query
# of rows in the accessing tables ? – A drastic increase in the number of rows from 2M to 212M
Application team verified the cause and has taken preventive measures in rolling back a code change in another job that populated rows into APLN_TABL001.
All of the above in less than 20 minutes and you solve the mystery. SEV 2 ? Tell me about it or just avoid it !
No comments:
Post a Comment