This is the code I wrote to scrape tweets from Twitter with python through the Twitter Search (Rest) API. I designed it to search twitter for various keywords tweeted by users and return the result in table. Introducing this program into my workflow has saved me around 30 minutes per day by narrowing down the number of tweets I have to process manually by roughly 98% without sacrificing accuracy. I have since shared it with coworkers who have benefited from it as well. Well, let’s get started…

Step 1: Setting Parameters

The first thing we need to do is create some functions that will collect the search parameter that we want and format them to be used with the API.

This next function will take the usernames that we want to search and break them up into list of 10. We need to do this because twitter’a API only lets us search for tweets from around 10 specific usernames at a time. So we will be calling the API several times in quick succession depending on how many usernames that we need to scrape. I needed to scrape around 70 so I programmed this up to this amount but you can always add more if needed.

Step 2: Check for Existing Twitter Setting File

This next block of code will look to see if we already have search settings saved and if not it will collect the search params from our first set of functions above. It then stores these params for future reference.

These search parameters can be adjusted anytime in the csv file created in the directory.

Step 3: Collecting API Credentials

In order to scrape twitter we first need to quickly create twitter api app at https://apps.twitter.com/app/new. You may need to get approval which should take only a few minutes, but in some cases it could take a few days. Once your app is approved and you should be able to easily find your client_key and client_secret, which we will need in a sec.

This next block of code collects these credentials from the user and then encodes them into a format that is url friendly (i.e. no special characters, etc). to be honest, I’m not very knowledgeable about this part of process but according to the API documentation this is how its done.

We can now create the authorization url with our encoded keys and combine that with to the base url. This allows us to acquire our access (bearer ) token which we will add to the final url request.

Step 4: Calling the API With Our Parameters

This next function will take in our list of accounts that we want to scrape and scrape twitter along with the keywords and blacklisted words that we specified earlier and return the data as a Data Frame.

How This Would Look Manually

In the Call API function above we added search params that were already formatted for the API, but here is what these params would be formatted if we wanted to input this directly.

To include a twitter account to the search add ‘from:@user_name’ just as I have done below. We separate these usernames as well as the following keywords with ‘OR’. Between our usernames and keywords we add a ‘+’ to indicate ‘and’. Note: Any space between words is interpreted as a ‘+’ (i.e. &).

Let’s look at a quick example:

‘from:@TxDOTParis OR from:@TxDOTDallas + shift OR ramp + -single -wreck’

The query above gives us any tweets from either of the two accounts which contain either the keyword ‘shift’ or ‘ramp’ or both, and don’t contain the word ‘single’ and don’t contain the word ‘wreck’.

Step 5: Call the API for Our Group of Accounts

In this next block of code we are scraping each group of accounts that we created earlier. As long as a list of usernames isn’t empty we scrape the tweets and add them to the data frame containing the scraped data from the previous groups.

We then run a for loop to save these data points as variables. Depending on what info you need you may have to explore the data a bit more. You can do this with the code: print(Data[‘statuses’])

Step 6: Removing Duplicate Tweets

Many organizations automate their twitter and thus may tweet out the same alert or notification everyday for a period of time. To avoid getting duplicate tweets we first need to temporality remove the attached urls, which contain unique slugs that hinder us from detecting tweets that are otherwise identical. We can then add back the url’s once duplicates are removed.

It would be nice to have are tweets in chronological order so we will do this below by re-sorting the rows by the ‘date’ column, and re-setting the index.

Step 7: Export to CSV or Excel

The last step is to export the DataFrame to either an Excel or csv file. You can choose either one of the blocks of code below depending on what file type you want. This will create a file in the same directory that you are running the program in.

Output

After running the code you should have a csv or excel file in the same directory that you ran the code. Below is an screenshot of the csv file that resulted from my parameters.

csv file created by scraping tweets containing keywords with python

End Result

You can get the full code on my GitHub