Skip to content
This repository has been archived by the owner on Mar 30, 2023. It is now read-only.

fix(profile): ported user profile to v2 API endpoint #955

Merged
merged 30 commits into from
Oct 29, 2020

Conversation

himanshudabas
Copy link
Contributor

Fixed user profile feature which was broken since v1 endpoints were deprecated by twitter.

  • very quick method to get the timeline of a user (~3200 tweets, which includes retweets, replies).
  • unlike twint -u xyz results are sorted in this method.
  • fixed retweet_id, user_rt_id, user_rt and retweet_date which were earlier commented out.
  • code cleanup
  • refactoring
  • fixes No data is being retrieved if geo config is set !!! #947

@himanshudabas
Copy link
Contributor Author

himanshudabas commented Oct 12, 2020

One more thing i'd like to add.

Travis build will fail for the library even when everything works fine. that's because Travis is probably hosted on AWS/GCP and twitter doesn't provide guest token to AWS/GCP IP addresses. I have tested this on AWS with multiple IPs and none of them works.
But If we can get the guest token with a proxy then twitter doesn't check the IP and we can keep on fetching data without proxy.
So, we only need proxy for token refreshing.

I have no idea how proxies are implemented in twint, If some one could take this up it'd fix the travis issue.

@essentialols
Copy link

essentialols commented Oct 13, 2020

Thanks for the commit! Will this scrape followers and followings?

@himanshudabas
Copy link
Contributor Author

(Sounds like only needed for initial token get, so no need to run full CI on tor)

@lmeyerov
check this out. I implemented the new solution for the above mentioned AWS problem.

What it does is:

  1. try to get the initial token using requests without proxy.
  2. If 1 fails, it creates a TOR Session and tries to get the initial token through that.

So the catch here is, TOR is really SLOW, but, I'am only using the TOR Session for requesting the initial token, which gets refreshed after 200 requests. So it doesn't actually degrade the performance of Twint more than a few seconds

@himanshudabas
Copy link
Contributor Author

-- use it: https://github.com/TheDataRideAlongs/ProjectDomino/blob/3e258e151495be9012e5d75650b53679384e977a/modules/TwintPool.py#L25

I see that you are running tor on docker? (forgive me if I'm wrong, I don't have much idea about Docker)
I initially tried running TOR on my machine and modified the initial Guest Token fetch for that but later realized that it'd be better to simply provide this functionality bundled into twint because user's might not want to (or may not know how to) setup TOR on their machines.
So I found this library torpy, which does exactly this. Since we don't actually require to make a lot of Token Fetch requests, it won't affect the speed of scraping at all.

@himanshudabas
Copy link
Contributor Author

@essentialols followers & following was never broken if I am correct.
it even works on the current version. have your tried fetching the followers & following?

@ixsure
Copy link

ixsure commented Oct 15, 2020

This is my first comment in Github, thank you so much. It's useful for my issue, despite tor isn't stable sometimes, may request timeout, still thank you so much. God bless you.

@himanshudabas
Copy link
Contributor Author

This is my first comment in Github, thank you so much. It's useful for my issue, despite tor isn't stable sometimes, may request timeout, still thank you so much. God bless you.

Yes you are right, Tor isn't stable a lot of times. But for that I'm catching the TimeoutException and create a new session if Timeout occurs.
Even better is Tor is only used to fetch the Token, so this would hardly affect the performance of twint.

I am glad this helped you. 👍

@himanshudabas
Copy link
Contributor Author

@pielco11 if you think there are some problems with this PR, please let me know, so that I can make necessary adjustments.

tombstone tweets are those tweets which are flagged by Twitter for being inappropriate, misleading, graphic etc.
This patch fixes the issue caused by twintproject#967, which broke the functionality of saving the retrieved data into a csv file.
@lmeyerov
Copy link
Contributor

FWIW, looking forward to trying this! cc @webcoderz

fixes twintproject#970, lookup is ported to v2 endpoint. this can now be used to lookup a certain profile.
@pielco11 pielco11 merged commit 52ee752 into twintproject:master Oct 29, 2020
@Greatdane
Copy link

On Python 3.8 I keep getting this error relating to dataclasses;

Traceback (most recent call last):
  File "/layers/google.python.webserver/gunicorn/gunicorn/arbiter.py", line 583, in spawn_worker
    worker.init_process()
  File "/layers/google.python.webserver/gunicorn/gunicorn/workers/gthread.py", line 92, in init_process
    super().init_process()
  File "/layers/google.python.webserver/gunicorn/gunicorn/workers/base.py", line 119, in init_process
    self.load_wsgi()
  File "/layers/google.python.webserver/gunicorn/gunicorn/workers/base.py", line 144, in load_wsgi
    self.wsgi = self.app.wsgi()
  File "/layers/google.python.webserver/gunicorn/gunicorn/app/base.py", line 67, in wsgi
    self.callable = self.load()
  File "/layers/google.python.webserver/gunicorn/gunicorn/app/wsgiapp.py", line 49, in load
    return self.load_wsgiapp()
  File "/layers/google.python.webserver/gunicorn/gunicorn/app/wsgiapp.py", line 39, in load_wsgiapp
    return util.import_app(self.app_uri)
  File "/layers/google.python.webserver/gunicorn/gunicorn/util.py", line 358, in import_app
    mod = importlib.import_module(module)
  File "/opt/python3.8/lib/python3.8/importlib/__init__.py", line 127, in import_module
    return _bootstrap._gcd_import(name[level:], package, level)
  File "<frozen importlib._bootstrap>", line 1014, in _gcd_import
  File "<frozen importlib._bootstrap>", line 991, in _find_and_load
  File "<frozen importlib._bootstrap>", line 975, in _find_and_load_unlocked
  File "<frozen importlib._bootstrap>", line 671, in _load_unlocked
  File "<frozen importlib._bootstrap_external>", line 783, in exec_module
  File "<frozen importlib._bootstrap>", line 219, in _call_with_frames_removed
  File "/srv/main.py", line 1, in <module>
    from ElonTweets.wsgi import application
  File "/srv/ElonTweets/wsgi.py", line 16, in <module>
    application = get_wsgi_application()
  File "/layers/google.python.pip/pip/django/core/wsgi.py", line 12, in get_wsgi_application
    django.setup(set_prefix=False)
  File "/layers/google.python.pip/pip/django/__init__.py", line 24, in setup
    apps.populate(settings.INSTALLED_APPS)
  File "/layers/google.python.pip/pip/django/apps/registry.py", line 122, in populate
    app_config.ready()
  File "/srv/website/apps.py", line 8, in ready
    from website import updater
  File "/srv/website/updater.py", line 5, in <module>
    import twint
  File "/layers/google.python.pip/pip/twint/__init__.py", line 12, in <module>
    from .config import Config
  File "/layers/google.python.pip/pip/twint/config.py", line 5, in <module>
    class Config:
  File "/layers/google.python.pip/pip/dataclasses.py", line 958, in dataclass
    return wrap(_cls)
  File "/layers/google.python.pip/pip/dataclasses.py", line 950, in wrap
    return _process_class(cls, init, repr, eq, order, unsafe_hash, frozen)
  File "/layers/google.python.pip/pip/dataclasses.py", line 800, in _process_class
    cls_fields = [_get_field(cls, name, type)
  File "/layers/google.python.pip/pip/dataclasses.py", line 800, in <listcomp>
    cls_fields = [_get_field(cls, name, type)
  File "/layers/google.python.pip/pip/dataclasses.py", line 659, in _get_field
    if (_is_classvar(a_type, typing)
  File "/layers/google.python.pip/pip/dataclasses.py", line 550, in _is_classvar
    return type(a_type) is typing._ClassVar
AttributeError: module 'typing' has no attribute '_ClassVar'

I assume this is in relation to #1000 and the fact the requirements now specify dataclasses, which are not needed in 3.8. Any help would be appreciated.

@himanshudabas
Copy link
Contributor Author

On Python 3.8 I keep getting this error relating to dataclasses;

I assume this is in relation to #1000 and the fact the requirements now specify dataclasses, which are not needed in 3.8. Any help would be appreciated.

@Greatdane
hi, I have put up a patch for it, your issue should be resolved now. let me know if it works for by installing directly from my branch.
I'll create a PR for it if it resolves your issue.

@Greatdane
Copy link

Greatdane commented Nov 12, 2020 via email

darvell pushed a commit to darvell/twint that referenced this pull request Nov 16, 2020
* fix for deprecation of v1.1 endpoints

* fix for cashtags

* typo

* fix(datetime): _formatDateTime tries %d-%m-%y

* fix(pandas): use new str-format Tweet.datetime data rep

* fix(pandas datetime): use ms

* fix(cashtags unwind): undo PRs field removals

* Revert "fix(cashtags unwind): undo PRs field removals"

This reverts commit dfa57c2.

* fix(pandas): remove broken fields

* fix(cash): use provided field as suggested by pr review

* fix (cashtags): re enable cashtags in output

* fix(db): remove broken fields

* fix(datetime): Y-m-d and factored out

* fixes twintproject#947

* fix(get.py): json exception in User

* to-do: added to-do tasks

added to-do tasks for --profile-full feature

* chore(test): PEP8 formatting

* fix(profile): ported user profile to v2 API

fixed user profile feature which was broken since v1 endpoints were deprecated

* updated Readme

* fix: fixes twintproject#965 inconsistent timezones

* fix: handle tombstone tweets

tombstone tweets are those tweets which are flagged by Twitter for being inappropriate, misleading, graphic etc.

* fixes twintproject#976: saving tweets to csv

This patch fixes the issue caused by twintproject#967, which broke the functionality of saving the retrieved data into a csv file.

* feature: port Lookup to v2 endpoint

fixes twintproject#970, lookup is ported to v2 endpoint. this can now be used to lookup a certain profile.

Co-authored-by: SiegfriedWagner <[email protected]>
Co-authored-by: lmeyerov <[email protected]>
@hadisfr
Copy link

hadisfr commented Dec 2, 2020

Hi!
It seems that --profile-full option is removed from app in this PR

because it is no longer required, default method will do this
https://github.com/twintproject/twint/blame/master/twint/cli.py#L207

It seems that in current version of code, twint won't handle shadow-banned or formerly-private accounts. Using --profile-full still fixes the issue.
Am I wrong or missing something? 🤔 If I am right, could you please bring that option back?

@himanshudabas
Copy link
Contributor Author

himanshudabas commented Dec 3, 2020

It seems that in current version of code, twint won't handle shadow-banned or formerly-private accounts. Using --profile-full still fixes the issue.

Hi,
Thanks for bringing this up. Could you please share se examples where the default or --timeline flag fails?
So that I can do a little bit of testing.

One more thing I'd like to add is, the older --profile-full feature is gonna be disfunctional anyway after 15 December, 2020.
Why?
Because it uses the mobile version (no javascript) of twitter, and twitter is gonna kill the mobile version on 15th.
So even if we bring that feature back, it'll stop working soon.

PS : feature like followers, following are gonna stop working too, because they too use the mobile version for scraping.

@hadisfr
Copy link

hadisfr commented Dec 3, 2020

@himanshudabas
I was not aware of --timeline option. I tested my case with --timeline and I got 3188 tweets, same as with --profile-full option. (Although I expected about 13k tweets, but I think it's impossible to get all those tweets due to limitations of Tweeter itself.)
So it may be a good idea to just update FAQ part of README and use --timeline there. 🤔

Is there any way to rewrite code and use desktop web version or insider graphql api instead of mobile version, to bring followers, followings, etc back?
And what was the difference between --profile-full and --timeline mechanisms?
I am too busy these months, but I may help writing some codes and create PRs after that. 🤔

@himanshudabas
Copy link
Contributor Author

I tested my case with --timeline and I got 3188 tweets, same as with --profile-full option. (Although I expected about 13k tweets, but I think it's impossible to get all those tweets due to limitations of Tweeter itself.)

It's because --timeline & --profile-full are essentially the same, except their internal working. While --profile-full uses the older mobile website to scrape tweets using BeautifulSoup, --timeline uses the newer v2 API. Limitation for both is, they can fetch a maximum of 3200 Tweets, Retweets, Replies. Because if you try to manually scroll to the bottom of a user's timeline on twitter, you can not fetch more than 3200 tweets.

So it may be a good idea to just update FAQ part of README and use --timeline there. 🤔

Yes, you are correct. Twint Wiki is past due for a long time now. But I am currently busy and unable to update it.

Is there any way to rewrite code and use desktop web version or insider graphql api instead of mobile version, to bring followers, followings, etc back?

Not that Iam aware of. I tried going through all the graphql endpoints and wasn't able to find any endpoint which can be used for this. Newer APIs require AUTh for this.
Do let me know if you find a way around this.

And what was the difference between --profile-full and --timeline mechanisms?

Technical difference is what I explained above. Non technically they are the same, then why did I rename it?
Because on Twitter it's called a timeline of the user, and it's better to keep things consistent with it.
Although I do apologise I didn't add it to the Wiki yet, perhaps you can make the changes?
Note : Only the command-line flag is changed from --profile-full to --timeline, important twint as a package in python script still works the same.

@hadisfr
Copy link

hadisfr commented Dec 3, 2020

There was only one occurrence of --profile-full. As you said, there isn't any config.Timeline yet. And I didn't change --followers and --followings.
Ref #1052

@himanshudabas
Copy link
Contributor Author

Looks good.

One other thing regarding your previous issue.

Although I expected about 13k tweets

If the user you are searching for is not shadow banned and is reachable through Twitter Advanced Search, you can use something like this,

twint -s "from:realDonaldTrump"

This will certainly fetch *almost all the tweets + replies from realDonaldTrump.

  • almost, because the total number of tweets that are shown on a user's profile contains all the tweets + retweets + replies + deleted tweets. Out of these Twitter Advanced Search can only fetch tweets + replies, there is a way to fetch retweets using Advanced Search but it's been broken for years now.

@hadisfr
Copy link

hadisfr commented Dec 3, 2020

The user was formerly private (I think), so I could not fetch all tweets from search and had to use the flag to scrab from the timeline.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Development

Successfully merging this pull request may close these issues.

No data is being retrieved if geo config is set !!!
9 participants