This is a program that I build to scrape post data from LinkedIn pages. It currently extracts the date, media type, text, like count, comment count, share count, and any media links. Let’s get into this..

Required Python Packages and Tools

As far as I know that data you can get from LinkedIn’s API is limited so we’ll be scraping data straight from the pages source code. To accomplish this we will use several tools that I’ll breakdown below:

  • Selenium: This tool works in conjunction with ChromeDriver to perform our desired functions like clicking links and scrolling. If you’ve never used selenium watching it can be cool to run because it appears as though someone is control of the screen.
  • Webdriver: This tool is like the middle man between Selenium and Google Chrome, which allows everything to run smoothly. You used to have to download the specific webdriver based on you version of Chrome but now the versioning is all handled automatically
  • Beautiful Soup: This is Python package that will allow us to find and access the various Linkedin elements that we would like to collect. It will scour through the page’s source code finding all of the tags that we instruct it to.

Now Onto the Code

To get started you just need to add you LinkedIn credentials, along with the page that you want to scrape. Its usually a good idea to store credential externally or as an environment variable but kept it simple to not overcomplicate things.

We now need to get Selenium to use ChromeDriver to open chrome and visit LinkedIn where it will sign in using your login info. You really can skip this and just login normally if you find that’s easier. In general there’s a tradeoff between degree of automation and the amount of debugging that will need to be done as LinkedIn changes the elements over time.

We next search for the Linkedin page that we want to scrape. I’m into backpacking and the outdoors so I’ll use REI as an example. We’ll go straight to the post section of the LinkedIn page via the url.

Now, this next section of code allows us to scroll down the entire Linkedin page. This is important because with an infinite scroll page such as Linkedin you would only get the first few posts without this step.

Wait, what’s Infinite Scroll?

This basically means that there aren’t a series of pages containing say 50 posts of so but rather a page that is as long as it needs to be to hold all of the content.

Enter Beautiful Soup…

We will now collect the source code and then starting looking for the page elements that we want to access.

How to find the Elements that I Want?

This is probably the trickiest step of this process and may require some trial and error. The best way to go about this is to go to the page that you would like to scrap and starting exploring the tags that identify the various elements that you would like to access.

You can do this by right clicking the desired element and clicking “Inspect Element”. This will open a window like in the example below.

Using this tool you can find all of the elements and sub elements that you require. I’ve already collected everything that we will need for our purposes, but if you would like to get additional elements or perform a similar task on another site, this is an important step.

Let’s create some helper functions to help with various tasks

LinkedIn display the post dates as ‘time since posted’ rather than the actual date so this first function called get_actual_date takes in the time period like 5 weeks and determines the approximate date the post was published.

Handling Sticking Points With Trial & Error

As you will discover at some point, this process is rarely straight forward and requires some testing to overcome some of the issues you might come into.

For instance, I ran into issues when collecting the video view count because the tag for this element had overlap with the tag for other social counts elements. Here’s an example of the overlap below in red.

  • View Count: “social-details-social-counts__item
  • Comment Count: “social-details-social-counts__reactions social-details-social-counts__item

Handling Different Post Types

Different pages may have slight differences from the other that you may have to play around with. For instance, some companies host their videos on Youtube while others may use Facebook. This will result in slightly different tags and available data. The series of embedded try/except functions in the next block of code are to handle these differences. It’s checking to see if one of the post type indicators appear in the post container.

Looping Through the Post

We will now loop through every post on the page, using the element identifiers that we found in the previous step to collect the data that we want and add to our list of post_data. This is the part of the code that could potentially need updating by the time you read this, but it will usually be a slight change in class names or something similar.

Lastly, we turn our post_data list which contains individual dictionaries of post data and create a pandas DataFame, and finally export it as a CSV and/or Excel file.

Our Table of LinkedIn Post Data

scraped LinkedIn posts data

You can get the complete code on my GitHub