Page MenuHomePhabricator

akosiaris (Alexandros Kosiaris)
Site Reliability EngineerAdministrator

Projects (27)

Today

  • Clear sailing ahead.

Tomorrow

  • Clear sailing ahead.

Thursday

  • Clear sailing ahead.

User Details

User Since
Oct 3 2014, 8:40 AM (524 w, 4 d)
Roles
Administrator
Availability
Available
IRC Nick
akosiaris
LDAP User
Alexandros Kosiaris
MediaWiki User
AKosiaris (WMF) [ Global Accounts ]

Blurb

Recent Activity

Today

akosiaris triaged T377805: WikiKube: Rename the last few "production" named helm releases to use "main" instead as Medium priority.
Tue, Oct 22, 9:09 AM · serviceops, Data-Engineering, Recommendation-API, events, Event-Platform, Proton
akosiaris added a comment to T376762: Remove `.cluster.local.` suffix in PTR responses.

Tracked separately in T377805

Tue, Oct 22, 9:04 AM · Kubernetes
akosiaris added a project to T377805: WikiKube: Rename the last few "production" named helm releases to use "main" instead: serviceops.
Tue, Oct 22, 9:03 AM · serviceops, Data-Engineering, Recommendation-API, events, Event-Platform, Proton
akosiaris created T377805: WikiKube: Rename the last few "production" named helm releases to use "main" instead.
Tue, Oct 22, 9:02 AM · serviceops, Data-Engineering, Recommendation-API, events, Event-Platform, Proton

Yesterday

akosiaris added a comment to T376762: Remove `.cluster.local.` suffix in PTR responses.

Should we attempt this on a cluster or two? One of the stagings, and then perhaps aux?

Mon, Oct 21, 2:59 PM · Kubernetes
akosiaris added a comment to T375845: WikiKube clusters close to exhausting Calico IPPool allocations.

Good question. Let me add some data points. We currently use:

Mon, Oct 21, 2:53 PM · Prod-Kubernetes, netops, Infrastructure-Foundations, serviceops
akosiaris added a comment to T377468: Cannot Run Golang or Rust Binaries with Provided AppArmor Profile.

To debug apparmor profiles, an easy way is to add the complain flag. So, flags=(attach_disconnected) becomes flags=(complain) and the actions will be allowed and logged. That should some pretty specific hints as to what exactly would break in the "normal" (not fallbacks implemented in code to workaround failures) mode.

Mon, Oct 21, 10:01 AM · serviceops
akosiaris added a comment to T357950: Remove servicerunner dependency for cxserver.

@akosiaris We deployed this code in staging. Only issue we observe is ECS logging is not parsed by logstash.

the msg field has values like 2024-10-16T07:19:25.484793362Z stdout F {"@timestamp":"2024-10-16T07:19:25.484Z","ecs.version":"8.10.0","log.level":"info","message":"GET /_info 200"}

Example: https://logstash.wikimedia.org/app/discover#/doc/logstash-*/logstash-k8s-1-7.0.0-1-2024.10.16?id=E9I2lJIB-wz6FOsqP_-w
(please ignore the info level log, we changed the log level after this)

The application prints logs to stdout as json lines
{"@timestamp":"2024-10-16T07:19:25.484Z","ecs.version":"8.10.0","log.level":"info","message":"GET /_info 200"}

Do you know why this parsing failed?

Mon, Oct 21, 9:40 AM · Patch-For-Review, LPL Essential (LPL Essential 2024 Jul-Sep), CX-cxserver, Technical-Debt

Fri, Oct 18

akosiaris added a project to T376795: mwscript-k8s creates too many resources: SRE-OnFire.
Fri, Oct 18, 2:39 PM · SRE-OnFire, Patch-For-Review, Sustainability (Incident Followup), serviceops, MW-on-K8s
akosiaris added a comment to T374907: sync-testservers-k8s takes 4 minutes when deploying a mediawiki-config change.

I 've evaluated the change today

Fri, Oct 18, 2:00 PM · serviceops, Release-Engineering-Team, Deployments
akosiaris added a comment to T377468: Cannot Run Golang or Rust Binaries with Provided AppArmor Profile.

Can you re-run these with strace so that we can figure out whether it open /dev/null for read or write? I 99% expect read, but wanna be sure. I think allow reads on /dev/null (if we don't already) is going to be fine.

Fri, Oct 18, 12:56 PM · serviceops
akosiaris added a comment to T376438: Download to PDF: HTTP 500 error on some wikis for some users.

For what is worth, I also update the dashboard at https://grafana-rw.wikimedia.org/d/U4TuF-lMk/proton?orgId=1 to allow querying both DCs at once as well as selectively and fixed the per container saturation panels that were broken. I also removed the nodejs garbage collection panels as those metrics aren't being emitted anymore (support has been dropped at the service runner level IIRC)

Fri, Oct 18, 9:49 AM · Essential-Work, Content-Transform-Team, serviceops, Electron-PDFs

Thu, Oct 17

akosiaris added a comment to T349118: Migrate node-based services in production to node18.
Thu, Oct 17, 1:55 PM · Content-Transform-Team, Platform Engineering, Trust and Safety Product Team (Engineering), Patch-For-Review, Essential-Work, Page Content Service, MediaWiki-Engineering, [DEPRECATED] wdwb-tech, Wikidata, Citoid, Wikidata-Termbox, Wikimedia-Portals, Data-Engineering, serviceops
akosiaris closed T109776: Tilerator should purge Varnish cache as Invalid.

Tilerator exists no more in the WMF environment. I 'll close this av invalid, feel free to reopen.

Thu, Oct 17, 1:45 PM · Traffic-Icebox, Maps, SRE, Varnish
akosiaris closed T109776: Tilerator should purge Varnish cache, a subtask of T137616: Epic: cultivating the Maps garden, as Invalid.
Thu, Oct 17, 1:44 PM · Maps (Maps-data), SRE, Epic
akosiaris updated subscribers of T363214: kafka-main100[6789] and kafka-main1010 implementation tracking.

Mistakenly removed @dcausse, re-adding.

Thu, Oct 17, 12:57 PM · serviceops
akosiaris updated subscribers of T363214: kafka-main100[6789] and kafka-main1010 implementation tracking.
Thu, Oct 17, 12:55 PM · serviceops
akosiaris added a comment to T374683: Switchover plan from RESTbase to REST Gateway for rest_v1/page/html and rest_v1/page/title endpoints.

I suggest we let the change ride next week's train and activate the rerouting on or after Oct. 28th

Thu, Oct 17, 7:38 AM · Patch-For-Review, serviceops, RESTBase Sunsetting, MW-Interfaces-Team

Wed, Oct 16

akosiaris added a comment to T359820: Developer Account Blocking: Migrate the one-stop Developer (un)Blocking from Wikitech to Bitu.

@taavi I just added you to the list of of people who can block users. You should have a "Block/unblock accounts" in the menu now.

What about anyone else? Please document a process for getting the necessary access.

Wed, Oct 16, 2:53 PM · collaboration-services, Infrastructure-Foundations, Bitu
akosiaris closed T372878: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets, a subtask of T354869: Re-IP codfw private baremetal hosts to new per-rack vlans/subnets, as Resolved.
Wed, Oct 16, 2:32 PM · netops, SRE, Infrastructure-Foundations
akosiaris closed T372878: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets, a subtask of T374366: Race condition in iptables rules during puppet runs on k8s nodes, as Resolved.
Wed, Oct 16, 2:32 PM · Infrastructure-Foundations, Kubernetes, Prod-Kubernetes, serviceops
akosiaris closed T372878: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets as Resolved.

I 'll resolve this. All hosts that could be renumbered have been renumbered. 6 hosts have been decomed instead. And then there is a number of jobrunners that will hopefully soon be reimaged to wikikube-worker nodes.

Wed, Oct 16, 2:32 PM · serviceops

Tue, Oct 15

akosiaris updated the task description for T374683: Switchover plan from RESTbase to REST Gateway for rest_v1/page/html and rest_v1/page/title endpoints.
Tue, Oct 15, 3:51 PM · Patch-For-Review, serviceops, RESTBase Sunsetting, MW-Interfaces-Team
akosiaris added a comment to T374888: Log messages at ERROR level on http channel: Special:Book unable to connect to https://tools.pediapress.com.
Tue, Oct 15, 3:44 PM · Patch-For-Review, Wikimedia-production-error, Collection
akosiaris added a comment to T348730: DRBD kernel error on ganeti2031 led to kernel hang.

Probably happened once more on ganeti1034 today. node is still bullseye fwiw.

Tue, Oct 15, 3:32 PM · Infrastructure-Foundations, SRE
akosiaris added a comment to T374683: Switchover plan from RESTbase to REST Gateway for rest_v1/page/html and rest_v1/page/title endpoints.

@HCoplin-WMF The internal mappings on rest-gateway work, all that's left is merging https://gerrit.wikimedia.org/r/c/1080232 and external traffic will move from being routed to restbase to being routed, via rest-gateway, to /w/script.php. It looks like, something major happening aside, we will be able after all to meet the Oct 17th date. Is there some communication or other action that you 'd like to take before the change is merged and deployed? Or should we just deploy it on that date?

Tue, Oct 15, 2:57 PM · Patch-For-Review, serviceops, RESTBase Sunsetting, MW-Interfaces-Team
akosiaris removed a project from T374683: Switchover plan from RESTbase to REST Gateway for rest_v1/page/html and rest_v1/page/title endpoints: Traffic.

After discussing with @hnowlan, I think I was wrong. We already have some infrastructure (in the form of lua scripts for ATS) that was built with Traffic to allow easier configuration of the routing and mappings of /api/rest_v1/ that I wasn't aware of. I 'll untag Traffic, I don't think they are needed for this one. I 've already pushed and merged a patch to configure the rest-gateway to do the mapping.

Tue, Oct 15, 2:30 PM · Patch-For-Review, serviceops, RESTBase Sunsetting, MW-Interfaces-Team
akosiaris added a project to T376438: Download to PDF: HTTP 500 error on some wikis for some users: Content-Transform-Team.

Adding content transform too.

Tue, Oct 15, 9:30 AM · Essential-Work, Content-Transform-Team, serviceops, Electron-PDFs
akosiaris edited projects for T374683: Switchover plan from RESTbase to REST Gateway for rest_v1/page/html and rest_v1/page/title endpoints, added: Traffic, serviceops; removed serviceops-radar.

I just had enough time to review this. This can't be implemented in the infrastructure that serviceops maintains but rather the CDN/Edge. Adding Traffic. I 'll also try to post a patch, but I 'll be needing traffic's OK. I 'll also post a patch but I think that implementing such a logic at the CDN/Edge layer is exposing too much information to the it. If done, it must be temporary and reverted at a very clearly agreed date and not left lingering.

Tue, Oct 15, 8:52 AM · Patch-For-Review, serviceops, RESTBase Sunsetting, MW-Interfaces-Team

Mon, Oct 14

akosiaris closed T376714: Evaluate running a statsd-exporter in the mw-script namespace as Resolved.

statsd-exporter deployment merged and deployed in both eqiad and codfw. It is addressable via the standard mechanisms that the other deployment share. General docs are at https://wikitech.wikimedia.org/wiki/Kubernetes/Metrics and https://wikitech.wikimedia.org/wiki/Prometheus/statsd_k8s, more specific to this implementation T365265: Create a per-release deployment of statsd-exporter for mw-on-k8s (and we need to update docs for this one once we resolve that task). I 'll resolve this for now, in the interest of not letting it linger. Feel free to reopen!

Mon, Oct 14, 10:57 AM · Patch-For-Review, MW-on-K8s, serviceops
akosiaris closed T376714: Evaluate running a statsd-exporter in the mw-script namespace, a subtask of T341553: Allow running one-off scripts manually, as Resolved.
Mon, Oct 14, 10:56 AM · Patch-For-Review, MW-on-K8s, serviceops
akosiaris closed T376961: host rdb1014 is down as Resolved.

The host has some history of failure per T370633: hw troubleshooting: CPU 2 machine check error detected for rdb1014.eqiad.wmnet

Mon, Oct 14, 10:17 AM · serviceops, SRE

Fri, Oct 11

akosiaris added a comment to T376762: Remove `.cluster.local.` suffix in PTR responses.

It won't require full rebuilds indeed. But it will require restarting all containers so that they pick up the new domain name. Which for most should be a noop cause, on purpose (due to discovery), we don't rely on these DNS records for most workloads. The hard part is probably figure out what (if anything) will break across all clusters and amend proactively.

Fri, Oct 11, 2:53 PM · Kubernetes
akosiaris added a comment to T374907: sync-testservers-k8s takes 4 minutes when deploying a mediawiki-config change.

@hashar Patch merged today. Starting next week, there should be some speed improvements (roughly cutting down the time in half). The unfortunate consequence is that someone using WikimediaDebug extension during that stage of the deployment might see errors. As I point out in the task, whether that's going to be acceptable is something to balance against the speedier deployments. Target group is rather small, so if they are made aware, it might just be worth it.

Fri, Oct 11, 1:50 PM · serviceops, Release-Engineering-Team, Deployments
akosiaris updated subscribers of T376632: decommission scandium.

@ssastry, @Arlolra, fyi scandium is no more, may it RIP.

Fri, Oct 11, 12:45 PM · SRE, ops-eqiad, DC-Ops, decommission-hardware
akosiaris edited projects for T376632: decommission scandium, added: ops-eqiad; removed serviceops.
Fri, Oct 11, 12:36 PM · SRE, ops-eqiad, DC-Ops, decommission-hardware
akosiaris updated the task description for T376632: decommission scandium.
Fri, Oct 11, 12:35 PM · SRE, ops-eqiad, DC-Ops, decommission-hardware
akosiaris added a comment to T376632: decommission scandium.

machine powered off manually

Fri, Oct 11, 12:35 PM · SRE, ops-eqiad, DC-Ops, decommission-hardware
akosiaris added a comment to T376976: Remove memory limits from critical cluster components (calico).

We 've already discussed this in a 1on1 and just for transparency's sake, this finds me in agreement. The outage we had this week carried a signal that we don't want to lose, that is that memory usage exploded over the course of a few hours, which in itself is a signal that something else was amiss (which is what T376795: mwscript-k8s creates too many resources is about). At the same time, an outage is the worst possible messenger. So finding some other way to keep the signal, like the alert pointed out above SGTM.

Fri, Oct 11, 7:15 AM · Patch-For-Review, Prod-Kubernetes, Kubernetes, Sustainability (Incident Followup), serviceops

Thu, Oct 10

akosiaris closed T366778: Evaluate the performance improvements brought in by prefetching MW images on WikiKube hosts as Resolved.

I 'll resolve this one. Things overall are OK deployment times wise. In fact, there is a downward trend in the following graph

Thu, Oct 10, 12:49 PM · MW-on-K8s, Scap, serviceops

Wed, Oct 9

akosiaris updated subscribers of T376795: mwscript-k8s creates too many resources.
Wed, Oct 9, 2:07 PM · SRE-OnFire, Patch-For-Review, Sustainability (Incident Followup), serviceops, MW-on-K8s
akosiaris added a comment to T371130: MoveComms support for Southward Datacenter Switchover (September 2024).

The reasons you give make a lot of sense. I'll document our process so that the person in charge books the time needed to change local times in the different messages, if not done by a volunteer first.

Using a bot would be nice, but we don't have the skills within the team to have one running. :)

Wed, Oct 9, 2:00 PM · MoveComms-Support, Datacenter-Switchover, serviceops
akosiaris added a comment to T376762: Remove `.cluster.local.` suffix in PTR responses.

+1 on an upgrade regardless of the rest.

Wed, Oct 9, 1:30 PM · Kubernetes
akosiaris added a comment to T371130: MoveComms support for Southward Datacenter Switchover (September 2024).

The retro is a bit late, sorry.

Wed, Oct 9, 1:10 PM · MoveComms-Support, Datacenter-Switchover, serviceops
akosiaris updated the task description for T353464: Migrate wikikube control planes to hardware nodes.
Wed, Oct 9, 11:38 AM · Patch-For-Review, serviceops, Prod-Kubernetes, Kubernetes
akosiaris added a comment to T353464: Migrate wikikube control planes to hardware nodes.

Noting that in Q3 FY24-25, that is the quarter starting on January 2025, we 'll be refreshing mw[2291-2376], which includes wikikube-ctrl2001 and wikikube-ctrl2001

Wed, Oct 9, 11:37 AM · Patch-For-Review, serviceops, Prod-Kubernetes, Kubernetes

Tue, Oct 8

akosiaris created T376714: Evaluate running a statsd-exporter in the mw-script namespace.
Tue, Oct 8, 1:08 PM · Patch-For-Review, MW-on-K8s, serviceops
akosiaris added a comment to T370755: Caching service request for MinT.

@Pginer-WMF @santhosh, I 've tried below to summarize our conversation in P+T meeting. Please do point out mistakes and I 'll correct them, I am mostly reconstructing from memory. I 've also added a suggested path forward for next actions, let me know what you think.

Tue, Oct 8, 12:47 PM · Community-Tech (Jackal (not a fox) Fox), serviceops, LPL Essential, Community Wishlist (Translations), MinT
akosiaris added a comment to T376519: Steady-state sizing of mw-web and mw-api-ext.

In an ideal world, this process would inform the upper and lower bounds of an HPA and we wouldn't need to come up with exact numbers, but rather rely on some metric (e.g. PHP idle workers) and let the system balance itself. And still we would need to run the process to update those bounds every now and then, because the world (and our traffic) isn't set in stone. So automating it a bit would be worth it.

Tue, Oct 8, 7:41 AM · Patch-For-Review, Datacenter-Switchover, serviceops
akosiaris closed T363402: parsoidtest1001 implementation tracking as Resolved.

/etc/envoy/envoy.yaml was empty on the new host. Deleting it and running puppet fixed it, and it seems fine now

Tue, Oct 8, 7:12 AM · Parsoid (Tracking), Patch-For-Review, serviceops
akosiaris closed T363402: parsoidtest1001 implementation tracking, a subtask of T363399: Q4:rack/setup/install parsoidtest1001, as Resolved.
Tue, Oct 8, 7:09 AM · Patch-For-Review, SRE, serviceops, ops-eqiad, DC-Ops

Mon, Oct 7

akosiaris closed T363402: parsoidtest1001 implementation tracking as Resolved.

We 'll be tracking decom of scandium in T376632: decommission scandium, I 'll resolve this, feel free to reopen if something weird comes up with parsoidtest1001.

Mon, Oct 7, 3:49 PM · Parsoid (Tracking), Patch-For-Review, serviceops
akosiaris closed T363402: parsoidtest1001 implementation tracking, a subtask of T363399: Q4:rack/setup/install parsoidtest1001, as Resolved.
Mon, Oct 7, 3:47 PM · Patch-For-Review, SRE, serviceops, ops-eqiad, DC-Ops
akosiaris created T376632: decommission scandium.
Mon, Oct 7, 3:43 PM · SRE, ops-eqiad, DC-Ops, decommission-hardware

Mon, Sep 30

akosiaris added a comment to T375645: Clean up the Docker Registry catalog and Swift storage from old images.

Nice job!

Mon, Sep 30, 1:37 PM · User-Elukey, Infrastructure-Foundations, SRE, serviceops

Fri, Sep 27

akosiaris updated subscribers of T375842: decommission mw[1349-1413].

@wiki_willy wikikube-ctrl1001 is in the batch that is being decommissioned in this task. It was procured in 2019. Given the work done in T353464 where we had to somehow get hold of a 10G nic card and reimage the node and all, I am wondering whether we can push refresh back for like 1 year or so and put it in the next fy budget. It feels like wasted work on all sides to find some other node and redo this. Does that sound ok?

Fri, Sep 27, 2:26 PM · decommission-hardware
akosiaris updated the task description for T375842: decommission mw[1349-1413].
Fri, Sep 27, 2:13 PM · decommission-hardware
akosiaris closed T285593: php7.2-fpm_check_restart should be resilient to php7adm error pages as Invalid.

3 years later, we no longer have appservers, this is probably best closed as invalid

Fri, Sep 27, 1:41 PM · serviceops, SRE
akosiaris updated the task description for T375842: decommission mw[1349-1413].
Fri, Sep 27, 1:25 PM · decommission-hardware
akosiaris closed T369744: wikikube-worker1240 to wikikube-worker1304 implementation tracking, a subtask of T369743: Q1:rack/setup/install wikikube-worker1240 to wikikube-worker1304, as Resolved.
Fri, Sep 27, 12:14 PM · Patch-For-Review, SRE, serviceops, ops-eqiad, DC-Ops
akosiaris closed T369744: wikikube-worker1240 to wikikube-worker1304 implementation tracking as Resolved.

This is done

Fri, Sep 27, 12:14 PM · serviceops
akosiaris added a comment to T374683: Switchover plan from RESTbase to REST Gateway for rest_v1/page/html and rest_v1/page/title endpoints.

Just want to update that we should be good to go on this by Monday. The final changes to support the compatibility layers are currently riding the train.

@akosiaris -- can we plan to take action on this on Monday from the ServiceOps side? Is there anything else y'all need beforehand?

Fri, Sep 27, 11:21 AM · Patch-For-Review, serviceops, RESTBase Sunsetting, MW-Interfaces-Team
akosiaris added a comment to T375845: WikiKube clusters close to exhausting Calico IPPool allocations.

Note that per past experience, changing the ippool is a arduous and dangerous process. We probably don't need that though and can live without aggregating on the configuration level the 2 /18s to a /17. On the BGP level, it's /26s anyway that get announced.

It indeed is, at least when changing the ip block size T345823: Wikikube staging clusters are out of IPv4 Pod IP's. It might be possible though to migrate to a new IPPool that contains the old one. We might as well aggregate the two pools into one when recreating the cluster for T341984: Update Kubernetes clusters to >1.25.

But we need to keep in mind that the cluster CIDR (pod IP range) is configured in multiple places in hiera (at least) and it needs to be configured in kube-proxy as well - where only one CIDR per stack (IPv4/IPv6) is supported.

Fri, Sep 27, 11:10 AM · Prod-Kubernetes, netops, Infrastructure-Foundations, serviceops
akosiaris triaged T363402: parsoidtest1001 implementation tracking as Medium priority.
Fri, Sep 27, 11:06 AM · Parsoid (Tracking), Patch-For-Review, serviceops
akosiaris updated subscribers of T363402: parsoidtest1001 implementation tracking.

@ssastry , @ihurbain parsoidtest1001 is up and running and should be a full replacement of scandium. Ι 've already updated https://wikitech.wikimedia.org/w/index.php?title=Parsoid&diff=prev&oldid=2230761 to reference the new hosts, if you know of other places, please update them. Basic alerts and checking does work but I have have missed something.

Fri, Sep 27, 11:01 AM · Parsoid (Tracking), Patch-For-Review, serviceops
akosiaris added a comment to T256098: Segfault for systemd-sysusers.service on stat1007.

And we 've just seen this on parsoidtest1001 which is bullseye. Old host, scandium is on buster.

Fri, Sep 27, 7:59 AM · Infrastructure-Foundations, SRE
akosiaris created T375845: WikiKube clusters close to exhausting Calico IPPool allocations.
Fri, Sep 27, 7:53 AM · Prod-Kubernetes, netops, Infrastructure-Foundations, serviceops
akosiaris added a comment to T354169: Evaluate usage of Kubernetes/Wikikube Tags in netbox and replace them with something if possible.

Upgrade is done, I had a bit more time to look into that.

Taking this list as baseline https://netbox.wikimedia.org/ipam/prefixes/?q=kubernetes (all the prefixes that mention "kubernetes" in their description.
We can see that there is already a drift between the tagged prefixes, some (like 2620:0:861:301::/116 are missing the k8s tag.

To my understanding, it would be the same functionally to replace the usage of the k8s tag with a Kubernetes "Prefix role" https://netbox.wikimedia.org/ipam/roles/ same kind of filtering, etc.
However getting in a smaller granularity (for example having a wikikube device role) seems overkill as it risks creating too many roles, but that's more a feeling than an absolute truth :)
Roles also don't support nesting, so we can't define a potential wikikube role as a kubernetes one.

I also suggest that we cleanup the current descriptions as the data they contain is redundant with the various prefixes attributes.
For example, taking some random prefixes from that list:

  • Kubernetes IP spaces (eqiad) -> (no description), kubernetes role, keep the current site and status attributes
  • kubernetes service IPs reservation (eqiad) -> Description: "WikiKube service", kubernetes role, keep the current site and status attributes
  • kubernetes pod IPs staging (eqiad) -> Description: "WikiKube staging pod", kubernetes role, keep the current site and status attributes
  • ML/DE Team Kubernetes IP spaces - ml-serve service IPs (eqiad) -> Description: "ml-serve service", kubernetes role, keep the current site and status attributes

Than way the descriptions are easier to read, filtering works more efficiently, we can decom the tags, and the usecase from the task description is taken care of.

Fri, Sep 27, 7:27 AM · Infrastructure-Foundations, netbox
akosiaris added a comment to T375842: decommission mw[1349-1413].

(43) mw[1352-1357,1360-1363,1367-1371,1374-1397,1399,1405,1408-1409].eqiad.wmnet are wikikube nodes (albeit not renamed and can be depooled and decomed ASAP. And they should suffice for the complication 3 from above. Proceeding

Fri, Sep 27, 7:07 AM · decommission-hardware
akosiaris added a comment to T369744: wikikube-worker1240 to wikikube-worker1304 implementation tracking.

1240 1241 1242 1243 1244 1246 1249 1250 1252 1253 1254 1257 1258 1259 1260 1262 1264 1265 1270 1271 1273 1274 1278 1279 1281 1282 1284 1286 1289 1290 1294 1296 1301 1303 1304 have been pooled and are service.

Fri, Sep 27, 7:03 AM · serviceops
akosiaris created T375842: decommission mw[1349-1413].
Fri, Sep 27, 7:01 AM · decommission-hardware

Thu, Sep 26

akosiaris added a comment to T330768: citoid having stability issues.

Whilst poking around open alerts, I noticed that both citoid and zotero have a few active alerts about "Container …-tls-proxy is consistently using [>95]% of its memory limit"; is this expected? Other TLS proxies for prod services seem to be using much less (and even the ones not alerting for citoid/zotero seem to be using more than average). Is it just an artefact of making many outbound requests?

Thu, Sep 26, 1:25 PM · Platform Team Workboards (Platform Engineering Reliability), serviceops, Citoid
akosiaris closed T361728: SwaggerProbeHasFailures for citoid (due to Zotero failures) after upgrading to node 18 as Resolved.

I 'll mark this as resolved. The failures were specific to the swagger probe and were due to lacking egress network policy rules for text-lb eqiad and codfw.

Thu, Sep 26, 1:18 PM · serviceops-radar, Citoid
akosiaris closed T361728: SwaggerProbeHasFailures for citoid (due to Zotero failures) after upgrading to node 18, a subtask of T349118: Migrate node-based services in production to node18, as Resolved.
Thu, Sep 26, 1:18 PM · Content-Transform-Team, Platform Engineering, Trust and Safety Product Team (Engineering), Patch-For-Review, Essential-Work, Page Content Service, MediaWiki-Engineering, [DEPRECATED] wdwb-tech, Wikidata, Citoid, Wikidata-Termbox, Wikimedia-Portals, Data-Engineering, serviceops
akosiaris added a comment to T370755: Caching service request for MinT.

@santhosh given the low traffic and the low storage needs, we could start off by adding memcached pods in translation service's namespace, and start from there. If we find that resource wise this is not enough, we re-iterate. What do you think?

For context, regarding the "low traffic" observation, it is worth noting that we had to disable a feature that exposed MinT machine translation to Wikipedia readers on 23 wikis because the number of requests were exceeding the server capacity. More details in T363338#10038113 and in this report.

Thu, Sep 26, 12:09 PM · Community-Tech (Jackal (not a fox) Fox), serviceops, LPL Essential, Community Wishlist (Translations), MinT

Wed, Sep 25

akosiaris closed T375544: Investigate stale dashboards after logstash.discovery.wmnet switch to codfw as Resolved.

Exclusion merged, doesn't look like there is a need for another followup I think I 'll close this as resolved.

Wed, Sep 25, 2:19 PM · Observability-Logging
akosiaris updated the task description for T349118: Migrate node-based services in production to node18.
Wed, Sep 25, 11:20 AM · Content-Transform-Team, Platform Engineering, Trust and Safety Product Team (Engineering), Patch-For-Review, Essential-Work, Page Content Service, MediaWiki-Engineering, [DEPRECATED] wdwb-tech, Wikidata, Citoid, Wikidata-Termbox, Wikimedia-Portals, Data-Engineering, serviceops
akosiaris added a comment to T374888: Log messages at ERROR level on http channel: Special:Book unable to connect to https://tools.pediapress.com.
Wed, Sep 25, 7:30 AM · Patch-For-Review, Wikimedia-production-error, Collection

Tue, Sep 24

akosiaris claimed T374907: sync-testservers-k8s takes 4 minutes when deploying a mediawiki-config change.

We can probably shave off ~2 minutes for the mw-debug environment following the same approach as in T375477. I 'll let the approach sit for 1 week or so and if no issues arise I 'll apply it to mw-debug too.

Tue, Sep 24, 1:18 PM · serviceops, Release-Engineering-Team, Deployments
akosiaris updated the task description for T366778: Evaluate the performance improvements brought in by prefetching MW images on WikiKube hosts.
Tue, Sep 24, 1:09 PM · MW-on-K8s, Scap, serviceops
akosiaris added a comment to T375497: scap leaves mediawiki main releases in the "forward" version after a canary rollback.

For what is worth, if scap was the only way helmfile commands would ever be run, the above would be totally ok. However, we can not rule out that someone will run a helmfile command and apply the newer version. This could lead to an incident if the newer version was broken in some way (which is probably the case after a rollback). Also, it's a surprise waiting to happen. It's not the end of the world as a bug, but has the potential to lead to an incident at some point.

Tue, Sep 24, 12:36 PM · Scap
akosiaris created T375497: scap leaves mediawiki main releases in the "forward" version after a canary rollback.
Tue, Sep 24, 12:34 PM · Scap
akosiaris added a comment to T375477: Helm deployment timeouts during train presync.

Adding 1 more data point. In the previous deployment I see also

Tue, Sep 24, 12:22 PM · MW-on-K8s, serviceops, Scap
akosiaris added a comment to T374888: Log messages at ERROR level on http channel: Special:Book unable to connect to https://tools.pediapress.com.

https://gerrit.wikimedia.org/r/c/mediawiki/extensions/Collection/+/1075157 is up, how do I get anyone to review it?

Tue, Sep 24, 12:09 PM · Patch-For-Review, Wikimedia-production-error, Collection
akosiaris added a comment to T375477: Helm deployment timeouts during train presync.
Tue, Sep 24, 10:34 AM · MW-on-K8s, serviceops, Scap
akosiaris added a comment to T374887: Log messages at ERROR level on http channel: Special:ExtensionDistributor unable to connect to https://graphite.wikimedia.org.

This probably was to embed graphite powered graphs in MediaWiki regarding extension download numbers. And the URL hasn't worked in quite a while anyway.

Tue, Sep 24, 10:28 AM · Unstewarded-production-error, Observability-Metrics, Patch-For-Review, Wikimedia-production-error, Grafana, ExtensionDistributor
akosiaris added a comment to T375477: Helm deployment timeouts during train presync.

https://logstash.wikimedia.org/goto/69fa724990f8f554ac97601360675c79 points out the the slowest image pull was at 2m27s, most were way faster. webserver ones at a few seconds, but also multiversion ones tend to be ~1m 30s (the 2m27s is for the mwdebug ones plus a couple from other deployments).

Tue, Sep 24, 10:07 AM · MW-on-K8s, serviceops, Scap
akosiaris added a comment to T375477: Helm deployment timeouts during train presync.

I see some evictions happening during the deployment that could explain this, trying to correlate.

Tue, Sep 24, 9:51 AM · MW-on-K8s, serviceops, Scap
akosiaris added a comment to T375477: Helm deployment timeouts during train presync.

This increase in deployment times coincides with the deployment to production of the following scap change: https://gitlab.wikimedia.org/repos/releng/scap/-/merge_requests/447. The corresponding task being T366778

Tue, Sep 24, 9:43 AM · MW-on-K8s, serviceops, Scap
akosiaris added a comment to T374888: Log messages at ERROR level on http channel: Special:Book unable to connect to https://tools.pediapress.com.

Hmm, I 've dug into Turnilo too and I fear my change isn't going to have any impact. Per https://w.wiki/BHjy, the top 5 user agents request wise, all showing Chrome version that are at least ~2 years old (a sign of bot activity), are without a Referer and from singaporean IPs

Tue, Sep 24, 8:18 AM · Patch-For-Review, Wikimedia-production-error, Collection
akosiaris added a comment to T374888: Log messages at ERROR level on http channel: Special:Book unable to connect to https://tools.pediapress.com.

Diving a bit deep in the rabbithole.

Tue, Sep 24, 7:53 AM · Patch-For-Review, Wikimedia-production-error, Collection

Mon, Sep 23

akosiaris added a comment to T374888: Log messages at ERROR level on http channel: Special:Book unable to connect to https://tools.pediapress.com.

I 've dug a bit into this. And it's a mess of historical reasons.

Mon, Sep 23, 2:47 PM · Patch-For-Review, Wikimedia-production-error, Collection
akosiaris added a comment to T224922: Code Stewardship Review: Collection Extension.

Drive by comment to say that pediapress book printing has probably been broken for a very long time and no-one noticed. See T374888.

Mon, Sep 23, 2:26 PM · MW-1.43-notes (1.43.0-wmf.25; 2024-10-01), Collection, Code-Stewardship-Reviews

Sep 20 2024

akosiaris added a comment to T371273: Verify our current wikikube capacity (in both DCs) can handle all our traffic.

Nicely done! Thanks for the detailed writeup!

Sep 20 2024, 9:46 AM · Datacenter-Switchover, serviceops
akosiaris claimed T374997: Some sites try and fail to serve favicon.ico.

@matmarex , I am pretty confident I 've found the reason and have a patch at https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/1074365. I 'd appreciate a review, mostly for avoiding side effects I am not aware of.

Sep 20 2024, 8:21 AM · Patch-For-Review, serviceops, MW-on-K8s, Traffic

Sep 19 2024

akosiaris added a comment to T374907: sync-testservers-k8s takes 4 minutes when deploying a mediawiki-config change.

For the OCI image being pulled, scap has a step described as docker pull on k8s nodes which happens before the sync-testservers. I can't tell whether it was fast this morning but surely I can look it up in the scap logs.

That specific step is being re-evaluated in T366778 and unless something changes dramatically, it will soon be no more. Don't spend cycles on it.

Sep 19 2024, 3:28 PM · serviceops, Release-Engineering-Team, Deployments
akosiaris moved T374997: Some sites try and fail to serve favicon.ico from Incoming 🐫 to Doing 😎 on the serviceops board.
Sep 19 2024, 3:13 PM · Patch-For-Review, serviceops, MW-on-K8s, Traffic
akosiaris added a comment to T375201: wikikube-worker1001 failed to docker pull on two consecutive deployments.

Thanks for fixing it @JMeybohm. For posterity's sake, the "fooing" part was related to T374366 and trying to figure out the race condition(s).

Sep 19 2024, 2:36 PM · serviceops

Sep 17 2024

akosiaris added a comment to T374907: sync-testservers-k8s takes 4 minutes when deploying a mediawiki-config change.
Sep 17 2024, 3:34 PM · serviceops, Release-Engineering-Team, Deployments
akosiaris added a comment to T374907: sync-testservers-k8s takes 4 minutes when deploying a mediawiki-config change.

Looking back in time a bit this isn't unheard of, so it doesn't look like some recent regression.

Sep 17 2024, 12:16 PM · serviceops, Release-Engineering-Team, Deployments