Blurb
User Details
- User Since
- Oct 3 2014, 8:40 AM (524 w, 4 d)
- Roles
- Administrator
- Availability
- Available
- IRC Nick
- akosiaris
- LDAP User
- Alexandros Kosiaris
- MediaWiki User
- AKosiaris (WMF) [ Global Accounts ]
Today
Tracked separately in T377805
Yesterday
Good question. Let me add some data points. We currently use:
To debug apparmor profiles, an easy way is to add the complain flag. So, flags=(attach_disconnected) becomes flags=(complain) and the actions will be allowed and logged. That should some pretty specific hints as to what exactly would break in the "normal" (not fallbacks implemented in code to workaround failures) mode.
Fri, Oct 18
I 've evaluated the change today
Can you re-run these with strace so that we can figure out whether it open /dev/null for read or write? I 99% expect read, but wanna be sure. I think allow reads on /dev/null (if we don't already) is going to be fine.
For what is worth, I also update the dashboard at https://grafana-rw.wikimedia.org/d/U4TuF-lMk/proton?orgId=1 to allow querying both DCs at once as well as selectively and fixed the per container saturation panels that were broken. I also removed the nodejs garbage collection panels as those metrics aren't being emitted anymore (support has been dropped at the service runner level IIRC)
Thu, Oct 17
Tilerator exists no more in the WMF environment. I 'll close this av invalid, feel free to reopen.
Mistakenly removed @dcausse, re-adding.
I suggest we let the change ride next week's train and activate the rerouting on or after Oct. 28th
Wed, Oct 16
I 'll resolve this. All hosts that could be renumbered have been renumbered. 6 hosts have been decomed instead. And then there is a number of jobrunners that will hopefully soon be reimaged to wikikube-worker nodes.
Tue, Oct 15
Probably happened once more on ganeti1034 today. node is still bullseye fwiw.
@HCoplin-WMF The internal mappings on rest-gateway work, all that's left is merging https://gerrit.wikimedia.org/r/c/1080232 and external traffic will move from being routed to restbase to being routed, via rest-gateway, to /w/script.php. It looks like, something major happening aside, we will be able after all to meet the Oct 17th date. Is there some communication or other action that you 'd like to take before the change is merged and deployed? Or should we just deploy it on that date?
After discussing with @hnowlan, I think I was wrong. We already have some infrastructure (in the form of lua scripts for ATS) that was built with Traffic to allow easier configuration of the routing and mappings of /api/rest_v1/ that I wasn't aware of. I 'll untag Traffic, I don't think they are needed for this one. I 've already pushed and merged a patch to configure the rest-gateway to do the mapping.
Adding content transform too.
I just had enough time to review this. This can't be implemented in the infrastructure that serviceops maintains but rather the CDN/Edge. Adding Traffic. I 'll also try to post a patch, but I 'll be needing traffic's OK. I 'll also post a patch but I think that implementing such a logic at the CDN/Edge layer is exposing too much information to the it. If done, it must be temporary and reverted at a very clearly agreed date and not left lingering.
Mon, Oct 14
statsd-exporter deployment merged and deployed in both eqiad and codfw. It is addressable via the standard mechanisms that the other deployment share. General docs are at https://wikitech.wikimedia.org/wiki/Kubernetes/Metrics and https://wikitech.wikimedia.org/wiki/Prometheus/statsd_k8s, more specific to this implementation T365265: Create a per-release deployment of statsd-exporter for mw-on-k8s (and we need to update docs for this one once we resolve that task). I 'll resolve this for now, in the interest of not letting it linger. Feel free to reopen!
The host has some history of failure per T370633: hw troubleshooting: CPU 2 machine check error detected for rdb1014.eqiad.wmnet
Fri, Oct 11
It won't require full rebuilds indeed. But it will require restarting all containers so that they pick up the new domain name. Which for most should be a noop cause, on purpose (due to discovery), we don't rely on these DNS records for most workloads. The hard part is probably figure out what (if anything) will break across all clusters and amend proactively.
@hashar Patch merged today. Starting next week, there should be some speed improvements (roughly cutting down the time in half). The unfortunate consequence is that someone using WikimediaDebug extension during that stage of the deployment might see errors. As I point out in the task, whether that's going to be acceptable is something to balance against the speedier deployments. Target group is rather small, so if they are made aware, it might just be worth it.
machine powered off manually
We 've already discussed this in a 1on1 and just for transparency's sake, this finds me in agreement. The outage we had this week carried a signal that we don't want to lose, that is that memory usage exploded over the course of a few hours, which in itself is a signal that something else was amiss (which is what T376795: mwscript-k8s creates too many resources is about). At the same time, an outage is the worst possible messenger. So finding some other way to keep the signal, like the alert pointed out above SGTM.
Thu, Oct 10
I 'll resolve this one. Things overall are OK deployment times wise. In fact, there is a downward trend in the following graph
Wed, Oct 9
+1 on an upgrade regardless of the rest.
Noting that in Q3 FY24-25, that is the quarter starting on January 2025, we 'll be refreshing mw[2291-2376], which includes wikikube-ctrl2001 and wikikube-ctrl2001
Tue, Oct 8
@Pginer-WMF @santhosh, I 've tried below to summarize our conversation in P+T meeting. Please do point out mistakes and I 'll correct them, I am mostly reconstructing from memory. I 've also added a suggested path forward for next actions, let me know what you think.
In an ideal world, this process would inform the upper and lower bounds of an HPA and we wouldn't need to come up with exact numbers, but rather rely on some metric (e.g. PHP idle workers) and let the system balance itself. And still we would need to run the process to update those bounds every now and then, because the world (and our traffic) isn't set in stone. So automating it a bit would be worth it.
/etc/envoy/envoy.yaml was empty on the new host. Deleting it and running puppet fixed it, and it seems fine now
Mon, Oct 7
We 'll be tracking decom of scandium in T376632: decommission scandium, I 'll resolve this, feel free to reopen if something weird comes up with parsoidtest1001.
Mon, Sep 30
Nice job!
Fri, Sep 27
@wiki_willy wikikube-ctrl1001 is in the batch that is being decommissioned in this task. It was procured in 2019. Given the work done in T353464 where we had to somehow get hold of a 10G nic card and reimage the node and all, I am wondering whether we can push refresh back for like 1 year or so and put it in the next fy budget. It feels like wasted work on all sides to find some other node and redo this. Does that sound ok?
3 years later, we no longer have appservers, this is probably best closed as invalid
This is done
@ssastry , @ihurbain parsoidtest1001 is up and running and should be a full replacement of scandium. Ι 've already updated https://wikitech.wikimedia.org/w/index.php?title=Parsoid&diff=prev&oldid=2230761 to reference the new hosts, if you know of other places, please update them. Basic alerts and checking does work but I have have missed something.
And we 've just seen this on parsoidtest1001 which is bullseye. Old host, scandium is on buster.
(43) mw[1352-1357,1360-1363,1367-1371,1374-1397,1399,1405,1408-1409].eqiad.wmnet are wikikube nodes (albeit not renamed and can be depooled and decomed ASAP. And they should suffice for the complication 3 from above. Proceeding
1240 1241 1242 1243 1244 1246 1249 1250 1252 1253 1254 1257 1258 1259 1260 1262 1264 1265 1270 1271 1273 1274 1278 1279 1281 1282 1284 1286 1289 1290 1294 1296 1301 1303 1304 have been pooled and are service.
Thu, Sep 26
I 'll mark this as resolved. The failures were specific to the swagger probe and were due to lacking egress network policy rules for text-lb eqiad and codfw.
Wed, Sep 25
Exclusion merged, doesn't look like there is a need for another followup I think I 'll close this as resolved.
Tue, Sep 24
We can probably shave off ~2 minutes for the mw-debug environment following the same approach as in T375477. I 'll let the approach sit for 1 week or so and if no issues arise I 'll apply it to mw-debug too.
For what is worth, if scap was the only way helmfile commands would ever be run, the above would be totally ok. However, we can not rule out that someone will run a helmfile command and apply the newer version. This could lead to an incident if the newer version was broken in some way (which is probably the case after a rollback). Also, it's a surprise waiting to happen. It's not the end of the world as a bug, but has the potential to lead to an incident at some point.
Adding 1 more data point. In the previous deployment I see also
https://gerrit.wikimedia.org/r/c/mediawiki/extensions/Collection/+/1075157 is up, how do I get anyone to review it?
This probably was to embed graphite powered graphs in MediaWiki regarding extension download numbers. And the URL hasn't worked in quite a while anyway.
https://logstash.wikimedia.org/goto/69fa724990f8f554ac97601360675c79 points out the the slowest image pull was at 2m27s, most were way faster. webserver ones at a few seconds, but also multiversion ones tend to be ~1m 30s (the 2m27s is for the mwdebug ones plus a couple from other deployments).
I see some evictions happening during the deployment that could explain this, trying to correlate.
This increase in deployment times coincides with the deployment to production of the following scap change: https://gitlab.wikimedia.org/repos/releng/scap/-/merge_requests/447. The corresponding task being T366778
Hmm, I 've dug into Turnilo too and I fear my change isn't going to have any impact. Per https://w.wiki/BHjy, the top 5 user agents request wise, all showing Chrome version that are at least ~2 years old (a sign of bot activity), are without a Referer and from singaporean IPs
Diving a bit deep in the rabbithole.
Mon, Sep 23
I 've dug a bit into this. And it's a mess of historical reasons.
Drive by comment to say that pediapress book printing has probably been broken for a very long time and no-one noticed. See T374888.
Sep 20 2024
Nicely done! Thanks for the detailed writeup!
@matmarex , I am pretty confident I 've found the reason and have a patch at https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/1074365. I 'd appreciate a review, mostly for avoiding side effects I am not aware of.
Sep 19 2024
Sep 17 2024
Looking back in time a bit this isn't unheard of, so it doesn't look like some recent regression.