Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Certain Sites Only Work When facebookexternalhit/1.1 User-Agent Is Specified #28

Closed
kachi1227 opened this issue Sep 21, 2020 · 3 comments

Comments

@kachi1227
Copy link

First of all, great library! Thanks for putting it together.

So earlier this year, Twitter stopped sending down open graph data unless specific user-agents were specified. The only one that I've gotten to work is the 'facebookexternalhit/1.1' user-agent and it should, in theory, work forever since it's the user-agent that the Facebook crawler uses. Unfortunately for me, I figured this out after a considerable amount of debugging on my end. In order to save users time, it might be helpful to update the code in the examples folder to the following:
$client = new Psr18Client(new NativeHttpClient(['headers'=>['User-Agent'=>'facebookexternalhit/1.1']]));

In case you're wondering, I've tested the code above with all major websites (Google, Facebook, CNN, Twitter, Youtube, Instagram, Amazon, etc) and they all return valid open graph data.

Attached are screenshots of the open graph data that gets returned when the following url is passed: https://twitter.com/CNN/status/1308170698175774720. The first screenshot is the result without a user-agent and the second screenshot is the result with the user-agent mentioned above.

No User-Agent:
Screen Shot 2020-09-21 at 6 50 03 PM

facebookexternalhit/1.1 User-Agent:
Screen Shot 2020-09-21 at 6 50 45 PM

@mburtscher
Copy link
Member

Well, this is interesting. I think I'll add a user-agent option which defaults to facebookexternalhit/1.1.

@mburtscher
Copy link
Member

Since client configuration is outside the library I think it wouldn't be a good idea to modify the client after it is passed to the Consumer. So I've added a note in the README and updated the example code.

Thank you for analyzing the issue and explaining so well!

@johnchristopher
Copy link

johnchristopher commented Dec 11, 2023

$client = new Psr18Client(new NativeHttpClient(['headers'=>['User-Agent'=>'facebookexternalhit/1.1']]));

In case you're wondering, I've tested the code above with all major websites (Google, Facebook, CNN, Twitter, Youtube, Instagram, Amazon, etc) and they all return valid open graph data.

This user-agent is banned/403ed on some European newspaper websites. Eg.: https://lesoir.be/.

$ curlh https://www.lesoir.be/ -H "User-Agent: facebookexternalhit/1.1"
HTTP/2 403 
server: AkamaiGHost
mime-version: 1.0
content-type: text/html
content-length: 262
date: Mon, 11 Dec 2023 09:43:22 GMT

$ curlh https://www.lesoir.be/ -H "User-Agent: foobar/2000"
HTTP/2 200 
x-content-type-options: nosniff
[..]

Symfony or Guzzle default client are not and then meta and og can be read. I ended up adding a switch depending on the host.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants