Skip to content

Conversation

@UDtorrey
Copy link
Collaborator

@UDtorrey UDtorrey commented Jan 13, 2026

This PR superseeds #27.

This pull request adds a comprehensive set of architecture diagrams and supporting documentation for the GKE deployment, focusing on both L4 (EPP) and L7 (Gateway API) traffic paths, Kubernetes and GCP resources, and their interdependencies. It also introduces a .cursorrules file outlining repository contribution rules. The new diagrams use Mermaid syntax and a consistent color scheme for clarity.

The most important changes are:

GKE Architecture Documentation and Diagrams:

  • Added a high-level GKE architecture diagram (gke-arch.mermaid) that visualizes the flow of traffic from the internet through the Gateway API, HTTPRoutes, Services, and Pods, including EPP and proxy paths.
  • Introduced a complete architecture diagram (gke-complete-architecture.mermaid) detailing the relationships between GCP resources (IPs, forwarding rules, backend services), Kubernetes resources (Gateways, HTTPRoutes, ServiceImports, Deployments, Pods), and supporting policies.
  • Added resource-specific diagrams for L4 EPP GCP resources (gke-l4-epp-gcp-resources.mermaid), L4 EPP Kubernetes resources (gke-l4-epp-k8s-resources.mermaid), L7 Gateway GCP resources (gke-l7-gateway-gcp-resources.mermaid), and L7 Gateway Kubernetes resources (gke-l7-gateway-k8s-resources.mermaid). [1] [2] [3] [4]
  • Added a resource dependency diagram (gke-resource-dependencies.mermaid) showing how GCP and Kubernetes resources depend on each other throughout the stack.

Repository Contribution Rules:

  • Introduced a .cursorrules file specifying guidelines for handling AI-generated content and commit messages.

weiminyu and others added 30 commits May 8, 2025 17:29
We plan on using this for EPP password resets and registry lock password
resets for now.
This avoids potential replication lag issues when requesting info on
domains that were just created.
* Remove registrar id from invoice grouping key

* Fix formatting issues

* Update BillingEventTests
This is just so that we can add an additional layer of security on
verification
If contacts are optional, they should be optional in the command too.
The Cloud DNS rest api is now case-sensitive about enum names (must be
lower case, counterintuitively).
We can probably improve on this in the future if we want, but there's a
lot of boilerplate that we don't need to repeat over and over
This picks up a few changes including aligning the placement of quotes
in text blocks with the Google style guide.
Add a new DnsWritersModule for use by the component classes.

To override the set of writers installed, we can easily overwrite this
file with a private version.
This implements two type of changes:
1. changing the link type for things like the terms of service
2. adding the request URL to each and every link with the "value" field.
   This is a bit tricky to implement because the links are generated in
various places, but we can implement it by adding it to the results
after generation.

See b/418782147 for more information
It was making the undo nomulus command look like this:

)nomulus ...
A future PR will add the actions that save and use this object. That
future PR will also require loading RegistrarPoc objects given the
registrar ID, hence the change in that class.
Pubapi actions should always use cache, regardless of the config
settings on caching.

In EppResource.java, the original `loadCached(Iterable<VKey>)`
method is renamed to `loadByCacheIfEnabled`. The original
`loadCached(Vkey)` method is renamed to `loadByCache` and always
uses cache.

In EppResourceUtils.java, the original `loadByForeignKeyCached`
method is renamed to `loadByForeignKeyByCacheIfEnabled`. A new
`loadByForeignKeyByCache` method, which always uses cache.

In ForeighKeyUtils.java, the original `loadCached` method is
renamed to `loadByCacheIfEnabled`, and a new `loadCached` method
is added which always uses cache.

Also added a `getContactsFromReplica` method in Registrar,
for use by RDAP actions.
Scripts needed by cron jobs wrongly removed by PR 2661.

TESTED: in crash.
In RDAP, domain queries are the most common by a factor of like 40,000
so we should optimize these as much as possible. We already have an EPP
resource / foreign key cache which does improve performance somewhat but
looking at some sample logs, it only cuts the RDAP request times by like
40% (looking at requests for the same domain a few seconds apart).

History entries don't change often, so we should cache them to make
subsequent queries faster as well. In addition, we're only caching two
fields per repo ID (modification time, registrar ID) so we can cache
more entries than we can for the EPP resource cache (which stores large
objects).
…oogle#2772)

This ratio defaults to 1.0 (i.e. all metrics will be recorded), but we will set
it much lower in sandbox and production, probably something closer to 0.01. This
will reduce recorded metrics volume and thus StackDriver cost, while still
retaining enough data for overall performance monitoring.

This is handled stochastically, so as to not require any coordination between
Java threads or GKE pods/clusters, as alternative approaches would (i.e. using a
counter and recording every Nth, or throttling to a max metrics qps).
…2780)

This is necessary so that the total number of requests/responses adds up
correctly even though some fraction of them are only being recorded. It uses
stochastic rounding so that the totals add up correctly even when the reciprocal
of the ratio isn't an integer.

This is a follow-up to PR google#2772.
From the response profile:
2.4.6. Registrar URL - The entity with the registrar role in the RDAP response
MUST contain a links member [RFC9083]. The links object MUST contain
the elements: value, identical to the the RDAP Base URL for the
Registrar as provided in the IANA “Registrar IDs” registry (i.e.,
https://www.iana.org/assignments/registrar-ids); rel:about, and href
containing the Registrar URL. Note: in cases where the Registry Operator
acts as sponsoring Registrar (e.g., IANA Registrar ID 9999), the href shall
contain a URL from the Registry.
This corresponds to the Feb 2024 response profile section 1.2 and
implementation guide 1.3 respectively, now that we comply (or are, at
least closer to complying), with the Feb 2024 versions.

This should probably depend on google#2771
because that includes a small change included in the Feb 2024 version

This also updates the documentation to reference the proper areas of the
specifications.
…oogle#2781)

This prohibits all contact data on create and update EPP flows for both domain
and contact flows. It also refactors how default values on FeatureFlags work, as
it's safer to specify a single default on the flag itself rather than have to
specify it independently at a number of callsites (and potentially end up having
an inconsistent value). Domain updates on existing domains that still have
contact data will fail unless all contact data is removed, as a forcing function
to require registrars to rectify the situation prior to being able to do any
other kind of domain changes.

Contact-related flows that are still allowed after this point: Updating a domain
to remove all contacts from it, and deleting a contact object.
This works fairly similarly to the registry lock request and
verification mechanism. The request action generates a UUI which is
emailed (in link form) to the user in question. The frontend will send a
request to the verify action with the UUID and hopefully the action
should be finalized.

EPP password requests can be sent by anyone with edit-registrar
permissions and must be approved by an admin POC email.

Registry lock password resets can only be sent by primary contacts, and
are verified/performed by the user in question.
Some of these have been around since the Datastore days and are no
longer relevant (dealing with things like Datastore foreign keys). Let's
simplify things.
This is only enabled for admins, for now at least. It sends an email to
the registry lock email address to reset it.
…enient (google#2787)

It will now only throw errors on domain updates if a new contact/registrant has
been specified where none was previously present. This means that domain updates
on unrelated fields (e.g. nameserver changes) will succeed even if there is
existing contact data that the update is not removing.

This is a follow-up to google#2781.

BUG=http://b/434958659
UDtorrey and others added 8 commits January 9, 2026 14:41
* add cloud profiler to dockerfile and start script

* add apt-get update

* change in cb machine type for nomulus

* fix typo

* add max worker limit to gradle tests

* Switch to root before doing apt-get

* correct dockerfile

* jetty/Dockerfile

* profiler service conditional to kubernetes container name
This means that attempting to add a status that is already present will now
fail, and attempting to remove a status that is not present will also now fail.

This also refactors the existing checks into a single verify method, rather than
having to call three separate methods from every callsite.

BUG= http://b/474645068
This primarily affects the EPP greeting. We already were erroring out when any
contact flows attempted to be run; this should just prevent registrars from even
trying them at all.

This PR is designed to be minimally invasive, and does not remove any of the
contact flows or Jakarta XML/XJC objects/files themselves. That can be done
later as a follow-up.

Also note that the contact namespace urn:ietf:params:xml:ns:contact-1.0 is still
present for now in RDE exports, but I'll remove that subsequently as well.

BUG= http://b/475506288
Copilot AI review requested due to automatic review settings January 13, 2026 18:19
@UDtorrey UDtorrey requested a review from a team as a code owner January 13, 2026 18:19
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This pull request introduces comprehensive MoSAPI (Monitoring of Service API) client functionality, including modules for configuration, HTTP communication, and service state monitoring. It also includes several significant refactorings to improve code quality and remove deprecated functionality.

Changes:

  • Added complete MoSAPI client implementation with support for service state monitoring, metrics, and error handling
  • Refactored EppResource and ForeignKeyUtils to improve caching and resource loading patterns
  • Updated contact and domain flows to support minimum dataset contacts changes
  • Added standard fee extension version 1.0 (RFC 8748) support with environment-specific availability

Reviewed changes

Copilot reviewed 298 out of 1275 changed files in this pull request and generated no comments.

Show a summary per file
File Description
core/src/main/java/google/registry/mosapi/* New MoSAPI client infrastructure including HTTP client, service monitoring, state management, and metrics
core/src/main/java/google/registry/model/ForeignKeyUtils.java Enhanced foreign key loading with separate caches for repo IDs and full resources
core/src/main/java/google/registry/flows/domain/* Updated domain flows to support minimum dataset contacts and improved resource loading
core/src/main/java/google/registry/model/domain/feestdv1/* New fee extension version 1.0 implementation classes
core/src/main/java/google/registry/model/registrar/RegistrarPoc.java Renamed WHOIS-related fields to RDAP and removed registry lock functionality

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@UDtorrey UDtorrey changed the title Sync with upstream google/nomulus Sync with upstream google/nomulus 01-13-2026 Jan 13, 2026
njshah301 and others added 18 commits January 13, 2026 18:48
…oring (google#2926)

* Configure cloud scheduler to trigger MoSAPI SLA status to cloud monitoring in production

- We have kept this job to trigger for every 3 minutes so that we get near to real time update for our task.
- This will not trigger metrics for now as we have not written Metrics triggering logic yet
- Logs are added

* Change Trigger scheduling from 3 minutes to 5 minutes
The primary annoyance with this is that it means we need (or at least,
should) split all tests that use the fee extension into two separate
tests -- one that simulates non-prod environments, and one that
simulates prod environments. This leads to duplication of many tests but
that's fine since this is theoretically temporary.
This is a follow-on to PR google#2909, which fixed the issue for domains, but
apparently not fully for hostnames.

BUG= http://b/476144993
This is necessary to pass RST, as we cannot have any mention of contacts in our
escrow files as we are a thin registry.

BUG= http://b/474636582
The actual error is fixed as a side effect of PR google#2935, but this adds tests
verifying the intended behavior.

BUG= http://b/476144993
* add jvm metrics

* include all changes

* Fix tests and lint errors

* Fix formatting

* Instantiate jvmmetrics class in stackdriver module

* add metrics registration behaviour and explicit call

* redo tests

* fix formatting/variable name

* lint
Now that we've passed the RST testing (or at least the EPP portion of
it) we are no longer bound by the restriction to only use the fee
extension version 1.0 on sandbox.

For now, in order to avoid changing prod behavior, this does not enable
advertisement of the fee extension version 1.0 in production. We can
change this at any point in the future.
Add network and subnetwork parameters to flex template build so Dataflow
workers can reach Cloud SQL via Private Service Access.

Co-Authored-By: Claude Opus 4.5 <[email protected]>
…eam-20260113

# Conflicts:
#	core/src/main/java/google/registry/flows/host/HostFlowUtils.java
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

9 participants