Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[pull] master from beetbox:master #1

Open
wants to merge 1,902 commits into
base: master
Choose a base branch
from
Open

Conversation

pull[bot]
Copy link

@pull pull bot commented Apr 6, 2022

See Commits and Changes for more details.


Created by pull[bot]

Can you help keep this open source service alive? 💖 Please sponsor : )

@pull pull bot added the ⤵️ pull label Apr 6, 2022
@snejus snejus force-pushed the master branch 3 times, most recently from c874689 to aa0db04 Compare November 22, 2024 01:36
snejus and others added 26 commits November 22, 2024 01:50
This utilises regex substitution in the substitute plugin. The previous
approach only used regex to match the pattern, then replaced it with a
static string. This change allows more complex substitutions, where the
output depends on the input.

### Example use case
Say we want to keep only the first artist of a multi-artist credit, as
in the following list:
```
Neil Young & Crazy Horse -> Neil Young
Michael Hurley, The Holy Modal Rounders, Jeffrey Frederick & The Clamtones -> Michael Hurley
James Yorkston and the Athletes -> James Yorkston
````
This would previously have required three separate rules, one for each
resulting artist. By using a regex substitution, we can get the desired
behaviour in a single rule:
```yaml
substitute:
  ^(.*?)(,| &| and).*: \1
```
(Capture the text until the first `,` ` &` or ` and`, then use that
capture group as the output)

### Notes
I've kept the previous behaviour of only applying the first matching
rule, but I'm not 100% sure it's the ideal approach.
I can imagine both cases where you want to apply several rules in
sequence and cases where you want to stop after the first match.
Quick fix for #5467.

Checks if the path for python is under the windows store folder then
error and point the user to the beets
[documentation](https://beets.readthedocs.io/en/stable/guides/main.html).

Happy for feedback to improve, but thought it best to exit as early as
possible.
Co-authored-by: Šarūnas Nejus <[email protected]>
Fixes #5148. 

When importing, the code that matches tracks does not consider the
medium number. This causes problems on Hybrid SACDs (and other releases)
where the artists, track numbers, titles, and lengths are the same on
both layers.

I added a distance penalty for mismatching medium numbers.

Before:

```
$ beet imp .

/Volumes/Music/ti/Red Garland/1958 - All Mornin' Long - 1 (6 items)

  Match (95.4%):
  The Red Garland Quintet - All Mornin' Long
  ≠ media, year
  MusicBrainz, 2xHybrid SACD (CD layer), 2013, US, Analogue Productions, CPRJ 7130 SA, mono
  https://musicbrainz.org/release/6a584522-58ea-470b-81fb-e60e5cd7b21e
  * Artist: The Red Garland Quintet
  * Album: All Mornin' Long
  * Hybrid SACD (CD layer) 1
     ≠ (#2-1) All Mornin' Long (20:21) -> (#1-1) All Mornin' Long (20:21)
     ≠ (#2-2) They Can't Take That Away From Me (10:24) -> (#1-2) They Can't Take That Away From Me (10:27)
     ≠ (#2-3) Our Delight (6:23) -> (#1-3) Our Delight (6:23)
  * Hybrid SACD (CD layer) 2
     ≠ (#1-1) All mornin' long (20:21) -> (#2-1) All Mornin' Long (20:21)
     ≠ (#1-2) They can't take that away from me (10:27) -> (#2-2) They Can't Take That Away From Me (10:25)
     ≠ (#1-3) Our delight (6:23) -> (#2-3) Our Delight (6:23)
➜ [A]pply, More candidates, Skip, Use as-is, as Tracks, Group albums,
Enter search, enter Id, aBort, eDit, edit Candidates?
```

Note that all tracks tagged with disc 1 get moved to disc 2 and vice
versa.

After:

```
$ beet-test imp .

/Volumes/Music/ti/Red Garland/1958 - All Mornin' Long - 1 (6 items)

  Match (95.4%):
  The Red Garland Quintet - All Mornin' Long
  ≠ media, year
  MusicBrainz, 2xMedia, 2013, US, Analogue Productions, CPRJ 7130 SA, mono
  https://musicbrainz.org/release/6a584522-58ea-470b-81fb-e60e5cd7b21e
  * Artist: The Red Garland Quintet
  * Album: All Mornin' Long
  * Hybrid SACD (CD layer) 1
     ≠ (#1-1) All mornin' long (20:21) -> (#1-1) All Mornin' Long (20:21)
     ≠ (#1-2) They can't take that away from me (10:27) -> (#1-2) They Can't Take That Away From Me (10:27)
     ≠ (#1-3) Our delight (6:23) -> (#1-3) Our Delight (6:23)
  * Hybrid SACD (SACD layer) 2
     * (#2-1) All Mornin' Long (20:21)
     * (#2-2) They Can't Take That Away From Me (10:24)
     * (#2-3) Our Delight (6:23)
➜ [A]pply, More candidates, Skip, Use as-is, as Tracks, Group albums,
Enter search, enter Id, aBort, eDit, edit Candidates?
```

Yay!
Except under GitHub CI, where we expect all tests to run.
JOJ0 and others added 30 commits January 24, 2025 10:44
…y prior to users importing their music library
… getting started guide #4820

## Description

Added a quick checkpoint to ensure the config file is set up correctly
prior to users importing their music library. This was something I
discovered later after running into an issue with my config file and
hope it helps new users avoid the issues I had.
This commit introduces a distance threshold mechanism for the Genius and
Google backends.

- Create a new `SearchBackend` base class with a method `check_match`
  that performs checking.
- Start using undocumented `dist_thresh` configuration option for good,
  and mention it in the docs. This controls the maximum allowable
  distance for matching artist and title names.

These changes aim to improve the accuracy of lyrics matching, especially
when there are slight variations in artist or title names, see #4791.
Improve requests performance with requests.Session which uses connection
pooling for repeated requests to the same host.

Additionally, this centralizes request configuration, making sure that
we use the same timeout and provide beets user agent for all requests.
Having removed it I fuond that only the Genius lyrics changed: it had en
extra new line. Thus I defined a function 'collapse_newlines' which now
gets called for the Genius lyrics.
Tidy up 'Google.is_page_candidate' method and remove 'Google.sluggify'
method which was a duplicate of 'slug'.

Since 'GeniusFetchTest' only tested whether the artist name is cleaned
up (the rest of the functionality is patched), remove it and move its
test cases to the 'test_slug' test.
* Type the response data that Google Custom Search API return.
* Exclude some 'letras.mus.br' pages that do not contain lyric.
* Exclude results from Musixmatch as we cannot access their pages.
* Improve parsing of the URL title:
  - Handle long URL titles that get truncated (end with ellipsis) for
    long searches
  - Remove domains starting with 'www'
  - Parse the title AND the artist. Previously this would only parse the
    title, and fetch lyrics even when the artist did not match.
* Remove now redundant credits cleanup and checks for valid lyrics.
Additionally, improve HTML pre-processing:

* Ensure a new line between blocks of lyrics text from letras.mus.br.
* Parse a missing last block of lyrics text from lacocinelle.net.
* Parse a missing last block of lyrics text from paroles.net.
* Fix encoding issues with AZLyrics by setting response encoding to
  None, allowing `requests` to handle it.
If we get caught by Cloudfare, it forwards our request somewhere else
and returns some validation text response. To make sure that this text
does not get assumed for lyrics, we can disable redirects for the Google
backend, check the response code and raise if there's a redirect
attempt. This source will then be skipped and the backend continues with
the next one.
I think we can make our life easier by removing these checks assuming
that users follow the instructions in the docs.
It was my mistake to remove search earlier - I found that in many cases
it works fine.
…tionality (#5474)

### Bug Fixes
- Fixed #4791: Resolved an issue with the Genius backend where it
couldn't match lyrics if there was a slight variation in the artist's
name.

### Plugin Enhancements
* **Session Management**: Introduced a `TimeoutSession` to enable
connection pooling and maintain consistent configuration across
requests.
* **Error Handling**: Centralized error handling logic in a new
`RequestsHandler` class, which includes methods for retrieving either
HTML text or JSON data.
* **Logging**: Added methods to ensure the backend name is included in
log messages.

### Configuration Changes
* Added a new `dist_thresh` field to the configuration, allowing users
to control the maximum tolerable mismatch between the artist and title
of the lyrics search result and their item. Interestingly, this field
was previously available (though undocumented) and used in the
`Tekstowo` backend. Now, this threshold has also been applied to
**Genius** and **Google** search logic.

### Backend Updates
* All backends that perform searches now validate each result against
the configured `dist_thresh`.

#### Genius
* Removed the need to scrape HTML tags for lyrics; instead, lyrics are
now parsed from the JSON data embedded in the HTML. This change should
reduce our vulnerability to Genius' frequent alterations in their HTML
structure.
* Documented the structure of their search JSON data.

#### Google
* Typed the response data returned by the Google Custom Search API.
* Excluded certain pages under **https://letras.mus.br** that do not
contain lyrics.
* Excluded all results from MusiXmatch, as we cannot access their pages.
* Improved parsing of URL titles (used for matching item/lyrics
artist/title):
- Handled results from long search queries where URL titles are
truncated with an ellipsis.
  - Enhanced URL title cleanup logic.
- Added functionality to determine (or rather, guess) not only the track
title but also the artist from the URL title.
* Similar to #5406, search results are now compared to the original item
and sorted by distance. Results exceeding the configured `dist_thresh`
value are discarded. The previous functionality simply selected the
first result containing the track's title in its URL, which often led to
returning lyrics for the wrong artist, particularly for short track
titles.
* Since we now fetch lyrics confidently, redundant checks for valid
lyrics and credits cleanup have been removed.

### HTML Cleanup
* Organized regex patterns into a new `Html` class.
* Adjusted patterns to ensure new lines between blocks of lyrics text
scraped from `letras.mus.br` and `musica.com`.
* Modified patterns to scrape missing lyrics text on `paroles.net` and
`lacoccinelle.net`. See the diff in `test/plugins/lyrics_page.py`.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.