Twitter data¶
Our goal is to create a program (daemon) that will run every set amount of time to query Twitter for tweets matching a given search query. We want to then store these tweets in a database for later analysis.
Goals¶
The goal of this project is to create a program that is run every set amount of time that retrieves recent tweets from Twitter matching a certain search query. The program should be able to
- Retrieve tweets from Twitter matching any search criteria
- Store retrieved tweets in a database
- Handle network failures gracefully
- Obey Twitter rate limits
Introduction¶
Luckily for us, Twitter offers and official API through which we can interact with their databases. To begin you will need to create an account with Twitter; we will use this account to request an API key for our application. You get get started by cloning the orie-5270-twitter.git repository.
Documentation¶
The Twitter API is well documented. The documentation home page is available at https://dev.twitter.com/overview/documentation. Interesting sections include authorizing your application and the actual REST API.
Requesting an API key¶
Log into your Twitter account, navigate to https://apps.twitter.com, and click the “Create New App” button. Once created you can click on your project to manage your API keys. It is very important that you keep your API keys secret. If you believe you have compromised your API key you need to go to this page and revoke the key immediately.
First steps¶
Your first goal should be to create a program that can request a which can be used by your program to make subsequent API calls. The previous link details the steps that need to be taken to achieve this; the following sections follow along with this documentation.
Throughout this section we will use the same keys as in the documentation, namely we assume we have
API_KEY = 'xvz1evFS4wEEPTGEFPHBog'
API_SECRET_KEY = 'L8qq9PZyRg6ieKGEKhZolGC0vJWLw8iEJ88DRdyOg'
You will want to proceed with the keys for your application, although you should check to make sure your code correctly encodes these keys.
Encoding the consumer key and secret key¶
We join the API key and API secret key with a colon to create the bearer credentials.
bearer_credentials = '{}:{}'.format(API_KEY, API_SECRET_KEY)
We then base64 encode these credentials
import base64
encoded = base64.b64encode(bytes(bearer_credentials, 'utf-8'))
credentials = str(encoded, 'utf-8')
Note that we first encode the bearer_credentials
string as
UTF-8 data, and then decode it back to a standard Python string.
Obtaining a bearer token¶
Once you have constructed your credentials from above, you are ready
to request a bearer token from Twitter. Using the requests
library, this can be done as follows.
headers = {
'Authorization': 'Basic {}'.format(credentials),
'Content-Type': 'application/x-www-form-urlencoded;charset=UTF-8'
}
data = 'grant_type=client_credentials'
auth_endpoint = 'https://api.twitter.com/oauth2/token'
r = requests.post(auth_endpoint, headers=headers, data=data)
The variable r
is a response
object that we should inspect for
our result. First, it should return the 200 status code (success).
if r.status_code != 200:
raise RuntimeError('Bad status code: {}'.format(r.status_code))
You should of course use a more reasonable exception type, since your application will need to handle errors gracefully. Assuming you received the correct response, you now need to verify you were given a bearer token. We convert the content of the response to its JSON representation.
resp = r.json()
if resp['token_type'] != 'bearer':
raise RuntimeError('Bad token type: {}'.format(resp['token_type'])
Again, use a better exception type. Finally, we can get our bearer token.
bearer_token = resp['access_token']
This token should be saved in your program to be used in future API calls. Note that this token will need to be refreshed periodically, so this entire process of obtaining a bearer token should be able to be reproduced with a single function call.
Performing a search¶
Now that you have a bearer token, you can start to query the Twitter API. To start, actually go to twitter.com and use the search bar to test your search query. Let’s try searching for ‘$UBER’, the stock ticker symbol for Uber. You should see a variety of tweets related to Uber stock data. Feel free to modify your search term to try to get more interesting results. Once you are confident your search term returns results you’d like, let’s have our application run this search.
import requests
headers = {
'Authorization': 'Bearer {}'.format(bearer_token),
}
params = {
'q': '$UBER',
'count': 3,
'result_type': 'recent'
}
search_endpoint = 'https://api.twitter.com/1.1/search/tweets.json'
r = requests.get(search_endpoint, params=params, headers=headers)
if r.status_code != 200:
raise RuntimeError('Bad status code: {}'.format(r.status_code)
tweets = r.json()
The variable tweets
will now contain (up to) 3 tweets. Below I’ve
listed the first tweet I get when I execute this search.
>>> import json
>>> tweet = tweets['statuses'][0]
>>> print(json.dumps(tweet, indent=4))
{
"place": {
"country": "United States",
"name": "Manhattan",
"id": "01a9a39529b27f36",
"country_code": "US",
"contained_within": [],
"url": "https://api.twitter.com/1.1/geo/id/01a9a39529b27f36.json",
"bounding_box": {
"type": "Polygon",
"coordinates": [
[
[
-74.026675,
40.683935
],
[
-73.910408,
40.683935
],
[
-73.910408,
40.877483
],
[
-74.026675,
40.877483
]
]
]
},
"attributes": {},
"place_type": "city",
"full_name": "Manhattan, NY"
},
"in_reply_to_status_id_str": null,
"lang": "en",
"in_reply_to_user_id": 3294648321,
"entities": {
"hashtags": [],
"user_mentions": [
{
"name": "Gene Shmunster",
"screen_name": "NotaBubble",
"indices": [
0,
11
],
"id": 3294648321,
"id_str": "3294648321"
}
],
"symbols": [
{
"indices": [
37,
42
],
"text": "UBER"
}
],
"urls": []
},
"favorite_count": 0,
"contributors": null,
"in_reply_to_user_id_str": "3294648321",
"retweeted": false,
"is_quote_status": false,
"text": "@NotaBubble 62 Billion valuation on $UBER ? Talk about Bubbles. They need to
talk it up so that they can raise even more cash via idiots.",
"retweet_count": 0,
"source": "<a href=\"http://twitter.com\" rel=\"nofollow\">Twitter Web Client</a>",
"in_reply_to_status_id": null,
"favorited": false,
"geo": null,
"coordinates": null,
"created_at": "Mon Jan 11 17:01:06 +0000 2016",
"id": 686593687942467585,
"metadata": {
"result_type": "recent",
"iso_language_code": "en"
},
"in_reply_to_screen_name": "NotaBubble",
"id_str": "686593687942467585",
"user": {
"follow_request_sent": null,
"screen_name": "FilmProfessor9",
"listed_count": 27,
"has_extended_profile": false,
"profile_image_url_https": "https://pbs.twimg.com/profile_images/1371916509/zaustin
_normal.jpg",
"lang": "en",
"profile_image_url": "http://pbs.twimg.com/profile_images/1371916509/zaustin_normal
.jpg",
"name": "G Hawkins",
"utc_offset": null,
"entities": {
"description": {
"urls": []
}
},
"geo_enabled": true,
"profile_banner_url": "https://pbs.twimg.com/profile_banners/260559365/1407938044",
"url": null,
"description": "Former Wall St Bond Trader. Now Teach & Trade my own Money. MIT PhD
. Invest in Tech & Growth: $AAPL $FB $SBUX $DIS #PinkFloyd #FFNOW #PeteRose #StarTrek $NFLX
",
"profile_sidebar_border_color": "4D044D",
"protected": false,
"followers_count": 373,
"time_zone": null,
"profile_link_color": "709917",
"is_translation_enabled": false,
"profile_background_image_url": "http://pbs.twimg.com/profile_background_images/243
623269/barcino_245.JPG",
"verified": false,
"following": null,
"favourites_count": 3990,
"statuses_count": 4747,
"default_profile_image": false,
"profile_background_tile": true,
"default_profile": false,
"profile_sidebar_fill_color": "080508",
"profile_use_background_image": true,
"profile_background_image_url_https": "https://pbs.twimg.com/profile_background_ima
ges/243623269/barcino_245.JPG",
"friends_count": 864,
"profile_background_color": "030003",
"created_at": "Fri Mar 04 03:24:18 +0000 2011",
"id": 260559365,
"contributors_enabled": false,
"is_translator": false,
"notifications": null,
"location": "New York City ",
"profile_text_color": "FAF0FA",
"id_str": "260559365"
},
"truncated": false
}
Obviously there is a huge amount of information here, which is great from a machine learning perspective. Scanning the document we can see that the content can be found with
>>> tweet['text']
'@NotaBubble 62 Billion valuation on $UBER ? Talk about Bubbles. They need to talk it up so that they can raise even more cash via idiots.'
Other interesting fields are ‘retweeted’, ‘in_reply_to_user_id’, ‘place’, and ‘created_at’.
More APIs: Finding users¶
There are many APIs
available; let’s explore one more. Let’s suppose we want to learn
about the user @NotABubble` that the last tweet we found was in
response to. Using the endpoint linked above we can retrieve
information about individual users. First, let's get the user id and
screen name of ``@NotAbubble
.
>>> tweet['in_reply_to_user_id']
3294648321
>>> tweet['in_reply_to_screen_name']
'NotaBubble'
With this we can craft a request to get this user’s information.
params = {
'user_id': tweet['in_reply_to_user_id'],
'screen_name': tweet['in_reply_to_screen_name']
}
user_endpoint = 'https://api.twitter.com/1.1/users/show.json'
r = requests.get(user_endpoint, params=params, headers=headers)
if r.status_code != 200:
raise RuntimeError('Bad status code: {}'.format(r.status_code)
user = r.json()
When I run this code, I get the following result.
>>> print(json.dumps(user, indent=4))
{
"created_at": "Sat May 23 01:24:46 +0000 2015",
"profile_text_color": "000000",
"geo_enabled": false,
"has_extended_profile": false,
"profile_sidebar_border_color": "000000",
"profile_background_tile": false,
"entities": {
"url": {
"urls": [
{
"display_url": "leadersinvestmentclub.com/the-team/",
"indices": [
0,
22
],
"url": "http://t.co/ucbPJWbVpw",
"expanded_url": "http://leadersinvestmentclub.com/the-team/"
}
]
},
"description": {
"urls": []
}
},
"verified": false,
"profile_background_image_url": "http://abs.twimg.com/images/themes/theme1/bg.png",
"utc_offset": null,
"profile_banner_url": "https://pbs.twimg.com/profile_banners/3294648321/1432344426",
"profile_use_background_image": false,
"url": "http://t.co/ucbPJWbVpw",
"default_profile_image": false,
"id": 3294648321,
"notifications": null,
"profile_location": null,
"is_translation_enabled": false,
"lang": "en",
"profile_image_url_https": "https://pbs.twimg.com/profile_images/601921960986054658/hUh
Fx0hW_normal.jpg",
"profile_sidebar_fill_color": "000000",
"time_zone": null,
"contributors_enabled": false,
"profile_background_image_url_https": "https://abs.twimg.com/images/themes/theme1/bg.pn
g",
"id_str": "3294648321",
"profile_background_color": "000000",
"location": "",
"description": "F+ rated analyst at Pumper Joffrey specializing in Not A Bubble",
"profile_image_url": "http://pbs.twimg.com/profile_images/601921960986054658/hUhFx0hW_n
ormal.jpg",
"protected": false,
"profile_link_color": "3B94D9",
"statuses_count": 1642,
"followers_count": 94,
"favourites_count": 37,
"name": "Gene Shmunster",
"screen_name": "NotaBubble",
"following": null,
"listed_count": 5,
"is_translator": false,
"status": {
"created_at": "Mon Jan 11 17:51:13 +0000 2016",
"coordinates": null,
"in_reply_to_status_id_str": "686593687942467585",
"retweeted": false,
"favorite_count": 0,
"id_str": "686606303578333184",
"entities": {
"user_mentions": [
{
"indices": [
0,
15
],
"name": "G Hawkins",
"id": 260559365,
"screen_name": "FilmProfessor9",
"id_str": "260559365"
}
],
"urls": [],
"symbols": [],
"hashtags": []
},
"geo": null,
"text": "@FilmProfessor9 glad to see Gopro climbed some during my lunch... sell off
was getting ridiculous. Now at $15.4 again. Yeesh!",
"in_reply_to_user_id_str": "260559365",
"lang": "en",
"source": "<a href=\"http://twitter.com\" rel=\"nofollow\">Twitter Web Client</a>",
"in_reply_to_screen_name": "FilmProfessor9",
"contributors": null,
"in_reply_to_status_id": 686593687942467585,
"place": null,
"id": 686606303578333184,
"retweet_count": 0,
"in_reply_to_user_id": 260559365,
"truncated": false,
"favorited": false
},
"follow_request_sent": null,
"friends_count": 226,
"default_profile": false
}