How to Scrape Tweets Containing Keywords with Python

This is the code I wrote to scrape tweets from Twitter with python through the Twitter Search (Rest) API. I designed it to search twitter for various keywords tweeted by users and return the result in table. Introducing this program into my workflow has saved me around 30 minutes per day by narrowing down the number of tweets I have to process manually by roughly 98% without sacrificing accuracy. I have since shared it with coworkers who have benefited from it as well. Well, let’s get started…

Step 1: Setting Parameters

The first thing we need to do is create some functions that will collect the search parameter that we want and format them to be used with the API.


import pandas as pd
from datetime import date, datetime
import sys
import base64
import requests
import re


def get_usernames():
    while True:
        try: 
            usernames = input("Enter the Twitter usernames you'd like to track, sperated by a comma (example: @TxDOTDallas, @TxDOTFortWorth): ")
            if usernames[0][0] != '@':
                raise ValueError # this will send it to the print message and back to the input option            
            else:
                username_list = usernames.replace(" ", "").replace("@", "from:@").replace(",", " OR ")
            
            return username_list
        
        except ValueError:
            print("Hmmm... " + usernames + " do not look like a valid usernames. Let's try that again.")




def get_keywords():
    while True:
        try: 
            keywords = input("Enter the keywords you'd like to track, sperated by a comma (example: ramp, closure): ")
            keywords = keywords.replace(", ", ",").split(",")
            
            for keyword in keywords:
                if (" ") in keyword:
                    raise ValueError # this will send it to the print message and back to the input option            
            
            keywords = " OR ".join(keywords)
            return keywords
        
        except ValueError:
            print("Hmmm...keywords can only be single words, not phrases. Let's try that again.")



def get_blacklist():
    while True:
        try: 
            blacklist = input("Enter the words you'd like to avoid in your search, sperated by a comma (example: traffic, wreck): ")
            blacklist = blacklist.replace(", ", ",").split(",")
            
            for word in blacklist:
                if (" ") in word:
                    raise ValueError # this will send it to the print message and back to the input option  
            
            blacklist = " -".join(blacklist)
            blacklist = "-" + blacklist
            return blacklist
        
        except ValueError:
            print("Hmmm...blacklisted words can only be single words, not phrases. Let's try that again.")

This next function will take the usernames that we want to search and break them up into list of 10. We need to do this because twitter’a API only lets us search for tweets from around 10 specific usernames at a time. So we will be calling the API several times in quick succession depending on how many usernames that we need to scrape. I needed to scrape around 70 so I programmed this up to this amount but you can always add more if needed.

 def format_usernames():
    
    # Format usernames for search
    usernames = user['usernames'][0]
    usernames = usernames.replace(" ", "").split("OR")
    
    # Organize first set of usernames
    usernames_one = usernames[0]
    for item in usernames[1:11]:
        usernames_one += ' OR ' + item
        
    # Organize second set if > 11 usernames   
    try: 
        usernames_two = usernames[11]
    
        for item in usernames[12:23]:
            usernames_two += ' OR ' + item

    except IndexError:
        usernames_two = []
        
        
    # Organize second set if > 24 usernames    
    try: 
        usernames_three = usernames[23]
    
        for item in usernames[24:35]:
            usernames_three += ' OR ' + item

    except IndexError:
        usernames_three = []
        
        
    # Organize second set if > 36 usernames   
    try: 
        usernames_four = usernames[35]
    
        for item in usernames[36:47]:
            usernames_four += ' OR ' + item

    except IndexError:
        usernames_four = []
        
        
     # Organize second set if > 48 usernames   
    try: 
        usernames_five = usernames[47]
    
        for item in usernames[48:59]:
            usernames_five += ' OR ' + item

    except IndexError:
        usernames_five = []
               
        
     # Organize second set if > 60 usernames   
    try: 
        usernames_six = usernames[59]
    
        for item in usernames[60:71]:
            usernames_six += ' OR ' + item

    except IndexError:
        usernames_six = []
        
    return usernames_one, usernames_two, usernames_three, usernames_four, usernames_five, usernames_six

Step 2: Check for Existing Twitter Setting File

This next block of code will look to see if we already have search settings saved and if not it will collect the search params from our first set of functions above. It then stores these params for future reference.

These search parameters can be adjusted anytime in the csv file created in the directory.

 try:
    user = pd.read_csv("twitter_settings.csv", index_col=0)
    keywords = user['keywords'][0]
    blacklist = user['blacklist'][0]
    
    usernames_one, usernames_two, usernames_three, usernames_four, usernames_five, usernames_six = format_usernames()
    

except IOError:
    
    print("Let's get you setup!")
    
    # Get Scrape Settings
    usernames = get_usernames()
    keywords = get_keywords()
    blacklist = get_blacklist()
    date = date.today().strftime('%Y-%m-%d')
    filename = sys.argv[0]
    
    # create user dataframe
    col = {"usernames":usernames, "keywords":keywords, "blacklist":blacklist}
    user = pd.DataFrame(col, index= [date])
    user.to_csv("twitter_settings.csv", encoding='utf-8', index=True)
    
    usernames_one, usernames_two, usernames_three, usernames_four, usernames_five, usernames_six = format_usernames()
        
    print("Your twitter settings are all setup! You can edit these search settings directly through the file named 'twitter_settings.csv'. You can find this file at " + filename)

Step 3: Collecting API Credentials

In order to scrape twitter we first need to quickly create twitter api app at https://apps.twitter.com/app/new. You may need to get approval which should take only a few minutes, but in some cases it could take a few days. Once your app is approved and you should be able to easily find your client_key and client_secret, which we will need in a sec.

This next block of code collects these credentials from the user and then encodes them into a format that is url friendly (i.e. no special characters, etc). to be honest, I’m not very knowledgeable about this part of process but according to the API documentation this is how its done.

 try:
    f= open("credentials.txt","r")
    client_key = contents.replace("=",",").split(",")[1]
    client_secret = contents.replace("=",",").split(",")[3]
except:
    f= open("credentials.txt","w+")
    client_key = input('Client Key: ')
    client_secret = input('Client Secret: ')
    f.write("client_key={}, client_secret={}".format(client_key,client_secret))
    f.close()

#Reformat the keys and encode them
key_secret = '{}:{}'.format(client_key, client_secret).encode('ascii')

# Transform from bytes to bytes that can be printed
b64_encoded_key = base64.b64encode(key_secret)

#Transform from bytes back into Unicode
b64_encoded_key = b64_encoded_key.decode('ascii')

We can now create the authorization url with our encoded keys and combine that with to the base url. This allows us to acquire our access (bearer ) token which we will add to the final url request.

 #Adding authorization to base URL 

base_url = 'https://api.twitter.com/'
auth_url = '{}oauth2/token'.format(base_url)

auth_headers = {
    'Authorization': 'Basic {}'.format(b64_encoded_key),
    'Content-Type': 'application/x-www-form-urlencoded;charset=UTF-8'}

auth_data = {
    'grant_type': 'client_credentials'}

auth_resp = requests.post(auth_url, headers=auth_headers, data=auth_data)


#Creating bearer token

access_token = auth_resp.json()['access_token']

search_headers = {
    'Authorization': 'Bearer {}'.format(access_token)}

Step 4: Calling the API With Our Parameters

This next function will take in our list of accounts that we want to scrape and scrape twitter along with the keywords and blacklisted words that we specified earlier and return the data as a Data Frame.

 def Call_api(username_group):

    #Adding search parameters: ('OR' = or)('+' = and)('-' = not) *(space interpreted as &)

    search_params = {'q': username_group + " + " + keywords + " + " + blacklist,}

    #Adding Twitter search API to base URL
    search_url = '{}1.1/search/tweets.json'.format(base_url)

    #Putting all elements together and execute request 
    resp = requests.get(search_url, headers=search_headers, params=search_params)


    #Get the data from the 1st request
    import json
    scrape = json.loads(resp.content)
    
        #Extracting desired data from 1st scrape as lists
    users = []
    tweet_dates = []
    kw_tweets = []


    for tweet in scrape['statuses']:
        users.append(tweet['user']['screen_name'])
        kw_tweets.append(tweet['text'])
        dates = datetime.strptime(tweet['created_at'][4:-11], '%b %d %H:%M:%S').strftime('%m/%d %H:%M')
        tweet_dates.append(dates)
        
        
    #Creating df1 via pandas
    #Creating columns names and adding their data
    scrape = {
        "User" : users,
        "Date" : tweet_dates,
        "Tweet" : kw_tweets,   
    }

    #Create DataFrame
    df = pd.DataFrame(scrape)
    return df

How This Would Look Manually

In the Call API function above we added search params that were already formatted for the API, but here is what these params would be formatted if we wanted to input this directly.

To include a twitter account to the search add ‘from:@user_name’ just as I have done below. We separate these usernames as well as the following keywords with ‘OR’. Between our usernames and keywords we add a ‘+’ to indicate ‘and’. Note: Any space between words is interpreted as a ‘+’ (i.e. &).

 #Adding search parameters: ('OR' = or)('+' = and)('-' = not) *(space interpreted as &)

search_params = {
    'q': 'from:@DriveTSG OR from:@DFWConnector OR from:@TxDOTDallas OR from:@TxDOTLufkin OR from:@TXDOTWF\
        OR from:@TxDOTParis OR from:@TxDOTFortWorth OR from:@635east OR from:@TxDOTBryan OR from:@TXDOTWF\
        OR from:@TxDOTTyler OR from:@TxDOTAtlanta\
        + shift OR ramp OR switch OR months OR shift OR widening OR closed OR closure OR detour OR long-term OR complete\
        + -various -alternating -inside -single -vehicle -left -right -far-left -far-right' -wreck,
}

#Adding Twitter search API to base URL
search_url = '{}1.1/search/tweets.json'.format(base_url)

#Putting all elements together and execute request 
search_resp = requests.get(search_url, headers=search_headers, params=search_params)

#Check if request was successfull. '200' means success
search_resp

Let’s look at a quick example:

‘from:@TxDOTParis OR from:@TxDOTDallas + shift OR ramp + -single -wreck’

The query above gives us any tweets from either of the two accounts which contain either the keyword ‘shift’ or ‘ramp’ or both, and don’t contain the word ‘single’ and don’t contain the word ‘wreck’.

Step 5: Call the API for Our Group of Accounts

In this next block of code we are scraping each group of accounts that we created earlier. As long as a list of usernames isn’t empty we scrape the tweets and add them to the data frame containing the scraped data from the previous groups.

 
 #Scraping Group One
 DataFrame = Call_api(usernames_one)
 
 
 if usernames_two != []:
    
    df = Call_api(usernames_two)
    
    DataFrame = pd.concat([DataFrame,df], ignore_index=True)

    
else:
    pass



if usernames_three != []:
  
    df = Call_api(usernames_three)
    
    DataFrame = pd.concat([DataFrame,df], ignore_index=True)
    
else:
    pass



if usernames_four != []:
  
    df = Call_api(usernames_four)
    
    DataFrame = pd.concat([DataFrame,df], ignore_index=True)
    
else:
    pass



if usernames_five != []:
  
    df = Call_api(usernames_five)
    
    DataFrame = pd.concat([DataFrame,df], ignore_index=True)
    
else:
    pass



if usernames_six != []:
  
    df = Call_api(usernames_six)
    
    DataFrame = pd.concat([DataFrame,df], ignore_index=True)
    
else:
    pass

We then run a for loop to save these data points as variables. Depending on what info you need you may have to explore the data a bit more. You can do this with the code: print(Data[‘statuses’])

Step 6: Removing Duplicate Tweets

Many organizations automate their twitter and thus may tweet out the same alert or notification everyday for a period of time. To avoid getting duplicate tweets we first need to temporality remove the attached urls, which contain unique slugs that hinder us from detecting tweets that are otherwise identical. We can then add back the url’s once duplicates are removed.

It would be nice to have are tweets in chronological order so we will do this below by re-sorting the rows by the ‘date’ column, and re-setting the index.

 
 # Remove any duplicate tweets and sort
DataFrame['Tweet'].replace('(https://t.co)/(\w+)', r'\1', regex=True).drop_duplicates()
tweets = DataFrame.loc[DataFrame['Tweet'].replace('(https://t.co)/(\w+)', r'\1', regex=True).drop_duplicates().index]

#Sort the tweets by date
tweets = tweets.sort_values(by=['Date'], ascending=False)
tweets.reset_index(drop=True, inplace=True)

Step 7: Export to CSV or Excel

The last step is to export the DataFrame to either an Excel or csv file. You can choose either one of the blocks of code below depending on what file type you want. This will create a file in the same directory that you are running the program in.

#Export data to CSV file
tweets.to_csv("kw_tweets.csv", encoding='utf-8', index=False)

#Export table to Excel without index column
writer = pd.ExcelWriter("kw_tweets.xlsx", engine='xlsxwriter')
df.to_excel(writer, index=False)
writer.save()

Output

After running the code you should have a csv or excel file in the same directory that you ran the code. Below is an screenshot of the csv file that resulted from my parameters.

csv file created by scraping tweets containing keywords with python

End Result

You can get the full code on my GitHub

Post Views: 13,272

A Spider Bite Is Worth the Chance Of Becoming Spider-Man...

How to Scrape Tweets Containing Keywords with Python

Step 1: Setting Parameters

Step 2: Check for Existing Twitter Setting File

Step 3: Collecting API Credentials

Step 4: Calling the API With Our Parameters

How This Would Look Manually

Let’s look at a quick example:

Step 5: Call the API for Our Group of Accounts

Step 6: Removing Duplicate Tweets

Step 7: Export to CSV or Excel

Output

End Result

You can get the full code on my GitHub

Related

Christophe Garon

Leave a ReplyCancel reply

Follow Me On Social

Categories

Tags

Recent Posts

Meta

A Spider Bite Is Worth the Chance Of Becoming Spider-Man...

How to Scrape Tweets Containing Keywords with Python

Step 1: Setting Parameters

Step 2: Check for Existing Twitter Setting File

Step 3: Collecting API Credentials

Step 4: Calling the API With Our Parameters

How This Would Look Manually

Let’s look at a quick example:

Step 5: Call the API for Our Group of Accounts

Step 6: Removing Duplicate Tweets

Step 7: Export to CSV or Excel

Output

End Result

You can get the full code on my GitHub

Related

Christophe Garon

Leave a ReplyCancel reply

Follow Me On Social

STAY IN THE LOOP

Categories

Tags

Recent Posts

Meta

STAY IN THE LOOP