This is a program that I build to scrape post data from LinkedIn pages. It currently extracts the date, media type, text, like count, comment count, share count, and any media links. Let’s get into this..
Required Python Packages and Tools
As far as I know that data you can get from LinkedIn’s API is limited so we’ll be scraping data straight from the pages source code. To accomplish this we will use several tools that I’ll breakdown below:
- Selenium: This tool works in conjunction with ChromeDriver to perform our desired functions like clicking links and scrolling. If you’ve never used selenium watching it can be cool to run because it appears as though someone is control of the screen.
- Webdriver: This tool is like the middle man between Selenium and Google Chrome, which allows everything to run smoothly. You used to have to download the specific webdriver based on you version of Chrome but now the versioning is all handled automatically
- Beautiful Soup: This is Python package that will allow us to find and access the various Linkedin elements that we would like to collect. It will scour through the page’s source code finding all of the tags that we instruct it to.
Now Onto the Code
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.chrome.options import Options
from bs4 import BeautifulSoup as bs
import time
import pandas as pd
import re
from datetime import datetime
from dateutil.relativedelta import relativedelta
# Initialize Chrome options
chrome_options = Options()
# Set up date formatting for today's date
today = datetime.today().strftime('%Y-%m-%d')
To get started you just need to add you LinkedIn credentials, along with the page that you want to scrape. Its usually a good idea to store credential externally or as an environment variable but kept it simple to not overcomplicate things.
#LinkedIn Credentials
username=""
password=""
#Url of the Linkedin Page you want to scrape
page = 'https://www.linkedin.com/company/nike'
if page[-1] == "/":
company_name = page.split("/")[-2]
else:
company_name = page.split("/")[-1]
company_name = company_name.replace('-',' ').title()
We now need to get Selenium to use ChromeDriver to open chrome and visit LinkedIn where it will sign in using your login info. You really can skip this and just login normally if you find that’s easier. In general there’s a tradeoff between degree of automation and the amount of debugging that will need to be done as LinkedIn changes the elements over time.
#access Webriver
browser = webdriver.Chrome()
#Open login page
browser.get('https://www.linkedin.com/login?fromSignIn=true&trk=guest_homepage-basic_nav-header-signin')
#Enter login info:
elementID = browser.find_element(By.ID, "username")
elementID.send_keys(username)
elementID = browser.find_element(By.ID, "password")
elementID.send_keys(password)
elementID.submit()
We next search for the Linkedin page that we want to scrape. I’m into backpacking and the outdoors so I’ll use REI as an example. We’ll go straight to the post section of the LinkedIn page via the url.
#Go to webpage
post_page = page + '/posts'
post_page = post_page.replace('//posts','/posts')
browser.get(post_page)
Now, this next section of code allows us to scroll down the entire Linkedin page. This is important because with an infinite scroll page such as Linkedin you would only get the first few posts without this step.
Wait, what’s Infinite Scroll?
This basically means that there aren’t a series of pages containing say 50 posts of so but rather a page that is as long as it needs to be to hold all of the content.
SCROLL_PAUSE_TIME = 1.5
MAX_SCROLLS = False
# JavaScript commands
SCROLL_COMMAND = "window.scrollTo(0, document.body.scrollHeight);"
GET_SCROLL_HEIGHT_COMMAND = "return document.body.scrollHeight"
# Initial scroll height
last_height = browser.execute_script(GET_SCROLL_HEIGHT_COMMAND)
scrolls = 0
no_change_count = 0
while True:
# Scroll down to bottom
browser.execute_script(SCROLL_COMMAND)
# Wait to load page
time.sleep(SCROLL_PAUSE_TIME)
# Calculate new scroll height and compare with last scroll height
new_height = browser.execute_script(GET_SCROLL_HEIGHT_COMMAND)
# Increment no change count or reset it
no_change_count = no_change_count + 1 if new_height == last_height else 0
# Break loop if the scroll height has not changed for 3 cycles or reached the maximum scrolls
if no_change_count >= 3 or (MAX_SCROLLS and scrolls >= MAX_SCROLLS):
break
last_height = new_height
scrolls += 1
Enter Beautiful Soup…
We will now collect the source code and then starting looking for the page elements that we want to access.
company_page = browser.page_source
linkedin_soup = bs(company_page.encode("utf-8"), "html")
#print(linkedin_soup.prettify())
containers = linkedin_soup.find_all("div",{"class":"feed-shared-update-v2"})
containers = [container for container in containers if 'activity' in container.get('data-urn', '')]
print(len(containers))
#Saving the container html for bebugging purposes
containers_text = "\n\n".join([c.prettify() for c in containers])
with open(f"{company_name}_soup_containers.txt", "w+") as t:
t.write(containers_text)
How to find the Elements that I Want?
This is probably the trickiest step of this process and may require some trial and error. The best way to go about this is to go to the page that you would like to scrap and starting exploring the tags that identify the various elements that you would like to access.
You can do this by right clicking the desired element and clicking “Inspect Element”. This will open a window like in the example below.
Using this tool you can find all of the elements and sub elements that you require. I’ve already collected everything that we will need for our purposes, but if you would like to get additional elements or perform a similar task on another site, this is an important step.
Let’s create some helper functions to help with various tasks
LinkedIn display the post dates as ‘time since posted’ rather than the actual date so this first function called get_actual_date takes in the time period like 5 weeks and determines the approximate date the post was published.
def get_actual_date(date):
today = datetime.today().strftime('%Y-%m-%d')
current_year = datetime.today().strftime('%Y')
def get_past_date(days=0, weeks=0, months=0, years=0):
date_format = '%Y-%m-%d'
dtObj = datetime.strptime(today, date_format)
past_date = dtObj - relativedelta(days=days, weeks=weeks, months=months, years=years)
past_date_str = past_date.strftime(date_format)
return past_date_str
past_date = date
if 'hour' in date:
past_date = today
elif 'day' in date:
date.split(" ")[0]
past_date = get_past_date(days=int(date.split(" ")[0]))
elif 'week' in date:
past_date = get_past_date(weeks=int(date.split(" ")[0]))
elif 'month' in date:
past_date = get_past_date(months=int(date.split(" ")[0]))
elif 'year' in date:
past_date = get_past_date(months=int(date.split(" ")[0]))
else:
split_date = date.split("-")
if len(split_date) == 2:
past_month = split_date[0]
past_day = split_date[1]
if len(past_month) < 2:
past_month = "0"+past_month
if len(past_day) < 2:
past_day = "0"+past_day
past_date = f"{current_year}-{past_month}-{past_day}"
elif len(split_date) == 3:
past_month = split_date[0]
past_day = split_date[1]
past_year = split_date[2]
if len(past_month) < 2:
past_month = "0"+past_month
if len(past_day) < 2:
past_day = "0"+past_day
past_date = f"{past_year}-{past_month}-{past_day}"
return past_date
def convert_abbreviated_to_number(s):
if 'K' in s:
return int(float(s.replace('K', '')) * 1000)
elif 'M' in s:
return int(float(s.replace('M', '')) * 1000000)
else:
return int(s)
def get_text(container, selector, attributes):
try:
element = container.find(selector, attributes)
if element:
return element.text.strip()
except Exception as e:
print(e)
return ""
Handling Sticking Points With Trial & Error
As you will discover at some point, this process is rarely straight forward and requires some testing to overcome some of the issues you might come into.
For instance, I ran into issues when collecting the video view count because the tag for this element had overlap with the tag for other social counts elements. Here’s an example of the overlap below in red.
- View Count: “social-details-social-counts__item“
- Comment Count: “social-details-social-counts__reactions social-details-social-counts__item“
Handling Different Post Types
Different pages may have slight differences from the other that you may have to play around with. For instance, some companies host their videos on Youtube while others may use Facebook. This will result in slightly different tags and available data. The series of embedded try/except functions in the next block of code are to handle these differences. It’s checking to see if one of the post type indicators appear in the post container.
# Function to extract media information
def get_media_info(container):
media_info = [("div", {"class": "update-components-video"}, "Video"),
("div", {"class": "update-components-linkedin-video"}, "Video"),
("div", {"class": "update-components-image"}, "Image"),
("article", {"class": "update-components-article"}, "Article"),
("div", {"class": "feed-shared-external-video__meta"}, "Youtube Video"),
("div", {"class": "feed-shared-mini-update-v2 feed-shared-update-v2__update-content-wrapper artdeco-card"}, "Shared Post"),
("div", {"class": "feed-shared-poll ember-view"}, "Other: Poll, Shared Post, etc")]
for selector, attrs, media_type in media_info:
element = container.find(selector, attrs)
if element:
link = element.find('a', href=True)
return link['href'] if link else "None", media_type
return "None", "Unknown"
Looping Through the Post
We will now loop through every post on the page, using the element identifiers that we found in the previous step to collect the data that we want and add to our list of post_data. This is the part of the code that could potentially need updating by the time you read this, but it will usually be a slight change in class names or something similar.
posts_data = []
# Main loop to process each container
for container in containers:
post_text = get_text(container, "div", {"class": "feed-shared-update-v2__description-wrapper"})
media_link, media_type = get_media_info(container)
post_date = get_text(container, "div", {"class": "ml4 mt2 text-body-xsmall t-black--light"})
post_date = get_actual_date(post_date)
# Reactions (likes)
reactions_element = container.find_all(lambda tag: tag.name == 'button' and 'aria-label' in tag.attrs and 'reaction' in tag['aria-label'].lower())
reactions_idx = 1 if len(reactions_element) > 1 else 0
post_reactions = reactions_element[reactions_idx].text.strip() if reactions_element and reactions_element[reactions_idx].text.strip() != '' else 0
# Comments
comment_element = container.find_all(lambda tag: tag.name == 'button' and 'aria-label' in tag.attrs and 'comment' in tag['aria-label'].lower())
comment_idx = 1 if len(comment_element) > 1 else 0
post_comments = comment_element[comment_idx].text.strip() if comment_element and comment_element[comment_idx].text.strip() != '' else 0
# Shares
shares_element = container.find_all(lambda tag: tag.name == 'button' and 'aria-label' in tag.attrs and 'repost' in tag['aria-label'].lower())
shares_idx = 1 if len(shares_element) > 1 else 0
post_shares = shares_element[shares_idx].text.strip() if shares_element and shares_element[shares_idx].text.strip() != '' else 0
# Append the collected data to the posts_data list
posts_data.append({
"Page": company_name,
"Date": post_date,
"Post Text": post_text,
"Media Type": media_type,
"Likes": post_reactions,
"Comments": post_comments,
"Shares":post_shares,
"Likes Numeric": convert_abbreviated_to_number(post_reactions),
"Media Link": media_link
})
Lastly, we turn our post_data list which contains individual dictionaries of post data and create a pandas DataFame, and finally export it as a CSV and/or Excel file.
try:
df = pd.DataFrame(posts_data)
except:
for item in list(data.keys()):
print(item)
print(len(data[item]))
for col in df.columns:
try:
df[col] = df[col].astype(int)
except:
pass
df.sort_values(by="Likes Numeric", inplace=True, ascending=False)
df.to_csv("{}_posts.csv".format(company_name), encoding='utf-8', index=False)
df.to_excel("{}_linkedin_posts.xlsx".format(company_name), index=False)
3 Pingbacks