This is the code I wrote to scrape tweets from Twitter with python through the Twitter Search (Rest) API. I designed it to search twitter for various keywords tweeted by users and return the result in table. Introducing this program into my workflow has saved me around 30 minutes per day by narrowing down the number of tweets I have to process manually by roughly 98% without sacrificing accuracy. I have since shared it with coworkers who have benefited from it as well. Well, let’s get started…
Step 1: Setting Parameters
The first thing we need to do is create some functions that will collect the search parameter that we want and format them to be used with the API.
import pandas as pd
from datetime import date, datetime
import sys
import base64
import requests
import re
def get_usernames():
while True:
try:
usernames = input("Enter the Twitter usernames you'd like to track, sperated by a comma (example: @TxDOTDallas, @TxDOTFortWorth): ")
if usernames[0][0] != '@':
raise ValueError # this will send it to the print message and back to the input option
else:
username_list = usernames.replace(" ", "").replace("@", "from:@").replace(",", " OR ")
return username_list
except ValueError:
print("Hmmm... " + usernames + " do not look like a valid usernames. Let's try that again.")
def get_keywords():
while True:
try:
keywords = input("Enter the keywords you'd like to track, sperated by a comma (example: ramp, closure): ")
keywords = keywords.replace(", ", ",").split(",")
for keyword in keywords:
if (" ") in keyword:
raise ValueError # this will send it to the print message and back to the input option
keywords = " OR ".join(keywords)
return keywords
except ValueError:
print("Hmmm...keywords can only be single words, not phrases. Let's try that again.")
def get_blacklist():
while True:
try:
blacklist = input("Enter the words you'd like to avoid in your search, sperated by a comma (example: traffic, wreck): ")
blacklist = blacklist.replace(", ", ",").split(",")
for word in blacklist:
if (" ") in word:
raise ValueError # this will send it to the print message and back to the input option
blacklist = " -".join(blacklist)
blacklist = "-" + blacklist
return blacklist
except ValueError:
print("Hmmm...blacklisted words can only be single words, not phrases. Let's try that again.")
This next function will take the usernames that we want to search and break them up into list of 10. We need to do this because twitter’a API only lets us search for tweets from around 10 specific usernames at a time. So we will be calling the API several times in quick succession depending on how many usernames that we need to scrape. I needed to scrape around 70 so I programmed this up to this amount but you can always add more if needed.
def format_usernames():
# Format usernames for search
usernames = user['usernames'][0]
usernames = usernames.replace(" ", "").split("OR")
# Organize first set of usernames
usernames_one = usernames[0]
for item in usernames[1:11]:
usernames_one += ' OR ' + item
# Organize second set if > 11 usernames
try:
usernames_two = usernames[11]
for item in usernames[12:23]:
usernames_two += ' OR ' + item
except IndexError:
usernames_two = []
# Organize second set if > 24 usernames
try:
usernames_three = usernames[23]
for item in usernames[24:35]:
usernames_three += ' OR ' + item
except IndexError:
usernames_three = []
# Organize second set if > 36 usernames
try:
usernames_four = usernames[35]
for item in usernames[36:47]:
usernames_four += ' OR ' + item
except IndexError:
usernames_four = []
# Organize second set if > 48 usernames
try:
usernames_five = usernames[47]
for item in usernames[48:59]:
usernames_five += ' OR ' + item
except IndexError:
usernames_five = []
# Organize second set if > 60 usernames
try:
usernames_six = usernames[59]
for item in usernames[60:71]:
usernames_six += ' OR ' + item
except IndexError:
usernames_six = []
return usernames_one, usernames_two, usernames_three, usernames_four, usernames_five, usernames_six
Step 2: Check for Existing Twitter Setting File
This next block of code will look to see if we already have search settings saved and if not it will collect the search params from our first set of functions above. It then stores these params for future reference.
These search parameters can be adjusted anytime in the csv file created in the directory.
try:
user = pd.read_csv("twitter_settings.csv", index_col=0)
keywords = user['keywords'][0]
blacklist = user['blacklist'][0]
usernames_one, usernames_two, usernames_three, usernames_four, usernames_five, usernames_six = format_usernames()
except IOError:
print("Let's get you setup!")
# Get Scrape Settings
usernames = get_usernames()
keywords = get_keywords()
blacklist = get_blacklist()
date = date.today().strftime('%Y-%m-%d')
filename = sys.argv[0]
# create user dataframe
col = {"usernames":usernames, "keywords":keywords, "blacklist":blacklist}
user = pd.DataFrame(col, index= [date])
user.to_csv("twitter_settings.csv", encoding='utf-8', index=True)
usernames_one, usernames_two, usernames_three, usernames_four, usernames_five, usernames_six = format_usernames()
print("Your twitter settings are all setup! You can edit these search settings directly through the file named 'twitter_settings.csv'. You can find this file at " + filename)
Step 3: Collecting API Credentials
In order to scrape twitter we first need to quickly create twitter api app at https://apps.twitter.com/app/new. You may need to get approval which should take only a few minutes, but in some cases it could take a few days. Once your app is approved and you should be able to easily find your client_key and client_secret, which we will need in a sec.
This next block of code collects these credentials from the user and then encodes them into a format that is url friendly (i.e. no special characters, etc). to be honest, I’m not very knowledgeable about this part of process but according to the API documentation this is how its done.
try:
f= open("credentials.txt","r")
client_key = contents.replace("=",",").split(",")[1]
client_secret = contents.replace("=",",").split(",")[3]
except:
f= open("credentials.txt","w+")
client_key = input('Client Key: ')
client_secret = input('Client Secret: ')
f.write("client_key={}, client_secret={}".format(client_key,client_secret))
f.close()
#Reformat the keys and encode them
key_secret = '{}:{}'.format(client_key, client_secret).encode('ascii')
# Transform from bytes to bytes that can be printed
b64_encoded_key = base64.b64encode(key_secret)
#Transform from bytes back into Unicode
b64_encoded_key = b64_encoded_key.decode('ascii')
We can now create the authorization url with our encoded keys and combine that with to the base url. This allows us to acquire our access (bearer ) token which we will add to the final url request.
#Adding authorization to base URL
base_url = 'https://api.twitter.com/'
auth_url = '{}oauth2/token'.format(base_url)
auth_headers = {
'Authorization': 'Basic {}'.format(b64_encoded_key),
'Content-Type': 'application/x-www-form-urlencoded;charset=UTF-8'}
auth_data = {
'grant_type': 'client_credentials'}
auth_resp = requests.post(auth_url, headers=auth_headers, data=auth_data)
#Creating bearer token
access_token = auth_resp.json()['access_token']
search_headers = {
'Authorization': 'Bearer {}'.format(access_token)}
Step 4: Calling the API With Our Parameters
This next function will take in our list of accounts that we want to scrape and scrape twitter along with the keywords and blacklisted words that we specified earlier and return the data as a Data Frame.
def Call_api(username_group):
#Adding search parameters: ('OR' = or)('+' = and)('-' = not) *(space interpreted as &)
search_params = {'q': username_group + " + " + keywords + " + " + blacklist,}
#Adding Twitter search API to base URL
search_url = '{}1.1/search/tweets.json'.format(base_url)
#Putting all elements together and execute request
resp = requests.get(search_url, headers=search_headers, params=search_params)
#Get the data from the 1st request
import json
scrape = json.loads(resp.content)
#Extracting desired data from 1st scrape as lists
users = []
tweet_dates = []
kw_tweets = []
for tweet in scrape['statuses']:
users.append(tweet['user']['screen_name'])
kw_tweets.append(tweet['text'])
dates = datetime.strptime(tweet['created_at'][4:-11], '%b %d %H:%M:%S').strftime('%m/%d %H:%M')
tweet_dates.append(dates)
#Creating df1 via pandas
#Creating columns names and adding their data
scrape = {
"User" : users,
"Date" : tweet_dates,
"Tweet" : kw_tweets,
}
#Create DataFrame
df = pd.DataFrame(scrape)
return df
How This Would Look Manually
In the Call API function above we added search params that were already formatted for the API, but here is what these params would be formatted if we wanted to input this directly.
To include a twitter account to the search add ‘from:@user_name’ just as I have done below. We separate these usernames as well as the following keywords with ‘OR’. Between our usernames and keywords we add a ‘+’ to indicate ‘and’. Note: Any space between words is interpreted as a ‘+’ (i.e. &).
#Adding search parameters: ('OR' = or)('+' = and)('-' = not) *(space interpreted as &)
search_params = {
'q': 'from:@DriveTSG OR from:@DFWConnector OR from:@TxDOTDallas OR from:@TxDOTLufkin OR from:@TXDOTWF\
OR from:@TxDOTParis OR from:@TxDOTFortWorth OR from:@635east OR from:@TxDOTBryan OR from:@TXDOTWF\
OR from:@TxDOTTyler OR from:@TxDOTAtlanta\
+ shift OR ramp OR switch OR months OR shift OR widening OR closed OR closure OR detour OR long-term OR complete\
+ -various -alternating -inside -single -vehicle -left -right -far-left -far-right' -wreck,
}
#Adding Twitter search API to base URL
search_url = '{}1.1/search/tweets.json'.format(base_url)
#Putting all elements together and execute request
search_resp = requests.get(search_url, headers=search_headers, params=search_params)
#Check if request was successfull. '200' means success
search_resp
Let’s look at a quick example:
‘from:@TxDOTParis OR from:@TxDOTDallas + shift OR ramp + -single -wreck’
The query above gives us any tweets from either of the two accounts which contain either the keyword ‘shift’ or ‘ramp’ or both, and don’t contain the word ‘single’ and don’t contain the word ‘wreck’.
Step 5: Call the API for Our Group of Accounts
In this next block of code we are scraping each group of accounts that we created earlier. As long as a list of usernames isn’t empty we scrape the tweets and add them to the data frame containing the scraped data from the previous groups.
#Scraping Group One
DataFrame = Call_api(usernames_one)
if usernames_two != []:
df = Call_api(usernames_two)
DataFrame = pd.concat([DataFrame,df], ignore_index=True)
else:
pass
if usernames_three != []:
df = Call_api(usernames_three)
DataFrame = pd.concat([DataFrame,df], ignore_index=True)
else:
pass
if usernames_four != []:
df = Call_api(usernames_four)
DataFrame = pd.concat([DataFrame,df], ignore_index=True)
else:
pass
if usernames_five != []:
df = Call_api(usernames_five)
DataFrame = pd.concat([DataFrame,df], ignore_index=True)
else:
pass
if usernames_six != []:
df = Call_api(usernames_six)
DataFrame = pd.concat([DataFrame,df], ignore_index=True)
else:
pass
We then run a for loop to save these data points as variables. Depending on what info you need you may have to explore the data a bit more. You can do this with the code: print(Data[‘statuses’])
Step 6: Removing Duplicate Tweets
Many organizations automate their twitter and thus may tweet out the same alert or notification everyday for a period of time. To avoid getting duplicate tweets we first need to temporality remove the attached urls, which contain unique slugs that hinder us from detecting tweets that are otherwise identical. We can then add back the url’s once duplicates are removed.
It would be nice to have are tweets in chronological order so we will do this below by re-sorting the rows by the ‘date’ column, and re-setting the index.
# Remove any duplicate tweets and sort
DataFrame['Tweet'].replace('(https://t.co)/(\w+)', r'\1', regex=True).drop_duplicates()
tweets = DataFrame.loc[DataFrame['Tweet'].replace('(https://t.co)/(\w+)', r'\1', regex=True).drop_duplicates().index]
#Sort the tweets by date
tweets = tweets.sort_values(by=['Date'], ascending=False)
tweets.reset_index(drop=True, inplace=True)
Step 7: Export to CSV or Excel
The last step is to export the DataFrame to either an Excel or csv file. You can choose either one of the blocks of code below depending on what file type you want. This will create a file in the same directory that you are running the program in.
#Export data to CSV file
tweets.to_csv("kw_tweets.csv", encoding='utf-8', index=False)
#Export table to Excel without index column
writer = pd.ExcelWriter("kw_tweets.xlsx", engine='xlsxwriter')
df.to_excel(writer, index=False)
writer.save()
Output
After running the code you should have a csv or excel file in the same directory that you ran the code. Below is an screenshot of the csv file that resulted from my parameters.
Leave a Reply