There is a huge lag on categories on commons.
See https://commons.wikimedia.org/wiki/Commons:Village_pump#Cat-a-lot_not_working_well for details.
Description
Related Objects
- Mentioned In
- T178840: Commons category updates are very slow. Commons job queue very large.
- Mentioned Here
- rMWa6262272c966: Properly make LinksUpdate be recursive when done from job queue
rMWfe323f9b68bb: Make LinksUpdate job spec use "refreshLinksPrioritized"
rMW3339dec7c227: Make LinksUpdate job spec use "refreshLinksPrioritized"
rMW1846e2dc15e9: Made LinksUpdate on edit use the job queue
rMW2a1e1d7dd88a: Push post-edit RefreshLinks jobs into 'refreshLinksPrioritized'
rMW7a4348640e09: Make triggerOpportunisticLinksUpdate() directly use RefreshLinks
rMW5b26e083b080: Push post-edit RefreshLinks jobs into 'refreshLinksPrioritized'
Event Timeline
Quoting Nemo from IRC:
maybe the jobqueue is more backlogged than usual; about 1 million now https://commons.wikimedia.org/w/api.php?action=query&meta=siteinfo&siprop=statistics
Worth to investigate, setting high priority.
I just removed some images from a category (by editing file page), so when I am on the image page the category does not show up, but when I look at the category the images are still there. No amount of purging and reloading the category helps. Please see for yourself New_uploads_without_a_license should be empty but shows 10 images at the time of this message
Is this kind of problem something that could be made visible in http://status.wikimedia.org? Everything currently tracked there shows up as green (service operating normally).
https://wikiapiary.com/wiki/Wikimedia_Commons shows a pretty large job queue spike.
legoktm@terbium:~$ mwscript showJobs.php --wiki=commonswiki --group refreshLinks: 50202 queued; 87963 claimed (12679 active, 75284 abandoned); 0 delayed deleteLinks: 0 queued; 9334 claimed (0 active, 9334 abandoned); 0 delayed htmlCacheUpdate: 4 queued; 0 claimed (0 active, 0 abandoned); 0 delayed recentChangesUpdate: 1 queued; 0 claimed (0 active, 0 abandoned); 0 delayed enqueue: 1268565 queued; 0 claimed (0 active, 0 abandoned); 0 delayed cirrusSearchIncomingLinkCount: 0 queued; 0 claimed (0 active, 0 abandoned); 346 delayed cirrusSearchLinksUpdatePrioritized: 0 queued; 1 claimed (1 active, 0 abandoned); 0 delayed gwtoolsetUploadMediafileJob: 0 queued; 194 claimed (5 active, 189 abandoned); 0 delayed gwtoolsetUploadMetadataJob: 0 queued; 0 claimed (0 active, 0 abandoned); 3 delayed RestbaseUpdateJobOnDependencyChange: 3793 queued; 1 claimed (1 active, 0 abandoned); 0 delayed webVideoTranscode: 3 queued; 115 claimed (58 active, 57 abandoned); 0 delayed
tailing runJobs.log on fluorine shows commonswiki jobs are being executed.
JobQueueRedis.log has a bunch of "Could not acknowledge refreshLinksPrioritized" type entries.
redis.log has some less frequent connection error messages.
jynus also said there has been an increase in mysql load since 12:23 PM UTC.
And according to wikiapiary, the enwp job queue is also increasing (didn't check any other ones).
Fixed by Aaron with:
- rMW5b26e083b080: Push post-edit RefreshLinks jobs into 'refreshLinksPrioritized'
- rMW7a4348640e09: Make triggerOpportunisticLinksUpdate() directly use RefreshLinks
There is still a large backlog of jobs, but it is shrinking.
Keeping this open until the queue is back to a manageable size.
What are the actionables here- why did this happen and how can we avoid/detect this in the future? Was this created after a very specific code deploy or due to external reasons? Should we focus on better queue monitoring? Let's create tasks for pending work.
Per jcrespo, this failure is worth having a more in-depth analysis than the normal operational bug fix. This failure has had widespread affects putting off contributors from categorizing images, and made it impossible for some long running maintenance tasks to continue.
If more budget would help ensure this type of operational failure can be avoided in the future, then an understandable report will help push the case to prioritize money in this area.
The number of jobs stopped growing, but isn't exactly shrinking either.
(I'm looking at a graph here: https://wikiapiary.com/wiki/Wikimedia_Commons choose "Jobs" from the dropdown under the graphs. Do we have our own graphs for this somewhere?)
iiuc, the length of job queue can be obtained by api
{ action: "query", meta: "siteinfo", siprop: "statistics" }
which, as of "now", shows this:
"query": { "statistics": { "pages": 40360963, "articles": 121114, "edits": 200884207, "images": 28731743, "users": 5257152, "activeusers": 31654, "admins": 249, "jobs": 2062146, "queued-massmessages": 0 } }
(and indeed, in repeated calls, "jobs" fluctuates around 2M, not showing significant shrinkage)
i would like to request and suggest in addition to job count, stats will report "fall through" time of the queue.
maybe this fall through time is difficult or "impossible" to obtain, and maybe is't actually easy - i do not know,
IMO, absolute precision is not important, and assuming each item on the queue has creation timestamp, i think it's enough to report the age of the last item removed from the queue at the time of removal.
maybe the queue is not 100% time-ordered, so this information may not represent the "fall through time" exactly, but i think it's as useful, or even more useful than the "queue length" datum.
peace.
So just to make sure I got this right. This is what happened?
- (Deployed to commons oct 14): 1846e2dc15e957c557efad7 - Makes default deferred job mode to be enqueue instead of immediately after end of request. This means that when you edit a page, the category changes (is supposed to) happen via the refreshLinksPrioritized queue.
- (Oct 20 03:30) An edit is made to an extremely high use template https://commons.wikimedia.org/w/index.php?title=Template:Information&curid=576289&diff=176222206&oldid=153048707
- Based on graphs, fairly immediately after that edit, job queue on commons increases rapidly.
- Normally, this would have triggered a large amount of recursive jobs. But recursive refreshLinks jobs are broken. So what happened is:
- htmlCacheUpdate updates page_touched
- as people view these pages, triggerOpportunisticRefreshLinks() is hit, as all these pages have {{CURRENTYEAR}} on them as {{information}} uses it to determine if a file is > 200 years old.
- RefreshLinksJob gets queued up as people view these pages.
- (oct 20 09:00) Users start reporting that categories they add directly to pages don't work
- 21:11 21:11 logmsgbot: ori@tin Synchronized php-1.27.0-wmf.2/includes/page/WikiPage.php: I5d0440588d (7a4348640e09bea): Make triggerOpportunisticLinksUpdate() directly use RefreshLinks (T116001)
- If we happen to parse a page just randomly that we need to refresh links on, and it happens to be cascade restricted, this causes us to move it to the high priority queue (The priotize parameter wasn't recognized previously because it was using the JobSpecification class, not the RefreshLinksJob class), since we don't need to reparse and can do the job quickly. It also removes one layer of indirection. I'm not sure how much this would help,
as the enqueued job is missing the rootJobTimestamp, so it won't detect that the opportunistic updates already did work. And most things aren't cascade protected, so this seems kind of irrelevant. But I quite likely could be missing something here. [Update: The removing one layer of indirection (Which applies to both the cascade protected and non-cascade protected opportunistic path) may have made a difference, as most of the jobs were from this code path]
- If we happen to parse a page just randomly that we need to refresh links on, and it happens to be cascade restricted, this causes us to move it to the high priority queue (The priotize parameter wasn't recognized previously because it was using the JobSpecification class, not the RefreshLinksJob class), since we don't need to reparse and can do the job quickly. It also removes one layer of indirection. I'm not sure how much this would help,
- Growth of job queue starts slowing down about here. At this point at 2.1 million jobs
- 23:25 logmsgbot: aaron@tin Synchronized php-1.27.0-wmf.3/includes/deferred: rMW2a1e1d7dd88a
- I find the commit message on this patch very confusing. As far as I can tell, the point is to set rootJobTimestamp, so that in many cases the job can know that it can safely use an old version from parser cache instead of having to re-phrase the page. This should make those refreshLinksJobs run much quicker and allow duplicate jobs to be removed.
- Based on the commit message, I think the intention was to also change the job name to refreshLinksPrioritized so it would be in the right queue, but that is not part of the patch (?) [UPDATE: Looks like that part was accidentally lost between PS3 and PS4. See 3339dec7c227]
- oct 21 02:00 Job queue hits peak of 2.2 million jobs. Starts decreasing very slowly, but mostly holding steady at about 2 million jobs
- oct 21 20:16 AaronSchulz: Started running 8 threads of commonswiki refreshlinks jobs on terbium
- Job queue length starts decreasing rapidly at this point (at rate of about 100,000 jobs an hour)
- oct 22 04:00 - Job queue continues to decrease rapidly. Currently about 1 million entries at commons
- [Update since original post] oct 22 05:10 logmsgbot: aaron@tin Synchronized php-1.27.0-wmf.3/includes/deferred/LinksUpdate.php: rMWfe323f9b68bb - Direct category additions now go in correct queue
- Commons users report that direct category additions now happen immediately. People are happier. Queue length ~800,000
- oct 22 21:05 logmsgbot: aaron@tin Synchronized php-1.27.0-wmf.3/includes: rMWa6262272c966 (duration: 00m 23s) - Fix lack of recursiveness in refreshLinksJobs
In summary:
- direct category additions were changed to use jobqueue
- Template with like 20 million backlinks was edited
- Everything after that point stuck behind massive job. Possibly due to bug where the deferred update was added to refreshLinks not refreshLinksPrioritized queue
- Missing rootJobTimestamp bug prevented some some de-duplication from occuring and using parser cache when possible, which exasperated the problem
Also note:
When the prioritized jobs just use the parameter and not the full refreshLinksPrioritized job type, the ack() method failed because it looked in the Prioritized queue but the job was in the regular queue. This triggered another 2 retries, which wasted more time.
Thanks for spelling all that out. I had no idea the job queues on Commons could run up that high, that fast. Fascinating stuff and thanks for fixing the problem.
Bawolff, thank you for clear explanation of the events. It seems like my poorly timed edit to Information template started it all. Looking at https://commons.wikimedia.org/wiki/Special:MostTranscludedPages it looks like Information template is actually #13 most transcluded page and we have 32 templates/modules with more than 5 million transclusions and 113 with over 1 million. As the number of the files on Commons seem to follow quadratic equation (see https://commons.wikimedia.org/wiki/Commons:Milestones) we expect a lot more files in the years to come and templates/modules with a lot more transclusions. If the trend continues, the number of files will double in 3 years. Many of the most transcluded Commons pages deal with translations (See #7, #12 or #26) and are expected to be changing each time someone wants to alter text in one of the 100's of languages. Are this type of slow-downs expected to be more common then?
Hmm. So at least locally, recursive jobs for refreshLinks are not being triggered (see https://gerrit.wikimedia.org/r/#/c/248119/ for fix). But if recursive jobs are not being triggered in production, then I'm rather confused as to what caused the job queue to explode on commons if not the recursive refreshLinks jobs from editing {{information}}.
Many of the jobs I saw getting run where from triggerOpportunisticLinksUpdate().
Ah. {{information}} contains {{Years since}} which calls {{CURRENTYEAR}}. So almost all pages on commons are considered to have dynamic content. So after the edit to {{information}}, htmlcacheupdate updates page_touched. Then opportunistLinksUpdate slowly starts updating all pages. Thus the LinksUpdate essentially ended up being recursive, despite the bug preventing it from actually being recursive.
An additional thing I noticed:
For a direct category add (Adding [[Category:Foo]] to a page). Prior to this change, we would use the ParserOutput just generated by the edit. However now, due to clock skewing from self::CLOCK_FUDGE, we are probably going to be reprasing the page (unless it takes more than 10 seconds to get to the job).
This reparsing is probably slowing these jobs down. If the job is from a direct edit action, a race condition seems less likely to be a concern (Not 100% sure on that. If it is a concern, maybe we could put a random number into the parser cache, and then match against that if the timestamp is within the CLOCK_FUDGE radius)
I am seeing similar behavior on en.WP. This change is not limited to Commons.
[added a couple hours later:] I have noticed this lag recently, just in the past couple of weeks, as I regularly clean out Category:Pages with citations using unsupported parameters and similar tracking categories. About one out of every twenty articles will remain in the category after I make a change that should remove it from the category. When I visit the article, the tracking category is not listed at the bottom, but until I do a null edit, the article stays on the category page. This is something new that has never happened to me in tens of thousands of similar edits over the past couple of years.
At this writing, you can see the articles "Fred McBrien" and "Padikkadavan (2009 film)" in the "unsupported parameters" category even though the unsupported parameters have been fixed.
I think there might be a race condition between the refreshLinks job and fetching from external storage.
So for the Fred McBrien article. @Jonesey95 ([[User:Abebenjoe]] presumably) edit that got rid of the category was:
rev_comment | rev_timestamp | rev_user_text |
/* top */ Fixing style/layout errors | 20151029015235 | Abebenjoe |
And the categorylinks / page entries look like:
*************************** 3. row *************************** cl_from: 13526177 cl_to: Pages_with_citations_using_unsupported_parameters cl_sortkey: FRED MCBRIEN cl_timestamp: 2015-10-29 01:50:42 cl_sortkey_prefix: cl_collation: uppercase cl_type: page page_id: 13526177 page_namespace: 0 page_title: Fred_McBrien page_restrictions: page_counter: 0 page_is_redirect: 0 page_is_new: 0 page_random: 0.940205091976 page_touched: 20151029015235 page_links_updated: 20151029015236 page_latest: 688010736 page_len: 5959 page_content_model: wikitext
So edit happened oct 29 01:52:35. As of right now, page_touched is also 01:52:35, and page_links_updated is 1 second later at 01:52:36. So the refreshLinks job succesfully was run. However the page is clearly still in the categorylinks table, despite the fact that the parser output from 01:52:35 does not have it in that category.
This is close enough to the self::CLOCK_FUDGE 10 second window, that the cached parser output was probably not used. Thus the page text was fetched from $revision->getContent( Revision::RAW );. However, external storage can also have slave lag. So maybe revision->getContent didn't actually get the right version of the page
There's code to wait for slaves( wfGetLB()->waitFor( $this->params['masterPos'] ); ). But that's only for main db. Maybe we should do something similar but for ExternalStorage cluster.
ExternalStore uses immutable blobs, so either the right object is found or nothing is found at all.
A large increase in run rate and a sharp overall queue size decrease started after https://gerrit.wikimedia.org/r/#/c/249687/ was deployed. I'm waiting to see how low the queue size gets.
From https://grafana.wikimedia.org/dashboard/db/job-queue-health?panelId=6&fullscreen we went from 3.25M to 72K jobs. Seems to have leveled out.
Oh, right.
Given that masterPos wouldn't be set from LinksUpdate::getAsJobSpecification(), I guess a race condition from normal slave lag could account for the behaviour that @Jonesey95 saw.
I had a look at the job queue at https://wikiapiary.com/wiki/Wikipedia_%28en%29
There is a recurring pattern, the job queue length is going up approx. every 7 days to 3-4M and then coming down slowly to some 100k just to go up again almost immediately. Strange. Unless there is a once a week power editor / background job putting the system above its limits, and the pattern continues to occur, this should be worth further examinations.