How I Scrape and Analyse Twitter Networks [Case Study]

How I Scrape and Analyse Twitter Networks [Case Study]

This ‘how-to’ report is to assist those conducting research on information operations or performing a network analysis on Twitter. It may also be useful to those seeking to hunt bots, track automation, capture breaking events or learn how to scrape data.

Because many have asked about this from varying pockets of the world, I have made sure all of the resources in this analysis are free.

I have broken this research down into three sections. They are based on how I conduct my network analysis – so a preliminary research stage, getting data and presenting the data.

  • Research preparation
    • Past information operation case studies
    • Identification of a trend
    • Monitoring hashtags in Tweetdeck
    • What data do you need for a network visualisation
  • Essential raw data
    • Tools to capture data from Twitter
    • Capturing data using Python
    • Converting captured data to graph
    • Location capture tweets to graph
    • Capturing data using Twitter API
  • How to cook data and present it nicely in Gephi
    • How I display network visualisations
    • How I organise nodes (accounts) into clusters
    • Analysis in Gephi’s Data Laboratory
    • Running a brief analysis of accounts
      • Botometer
      • TwitterAudit
      • Image reverse search
  • Concluding remarks

For the purpose of this report, I chose to use a dataset I worked on in 2019 on the #bolivianohaygolpe network. The data captured was based on a prominent hashtag used on the context of a coup in Bolivia at the time.

If you would like to follow along with this ‘how-to’ approach I have taken to this report with your own data, please do.

Research preparation

Before we start scouring the entire internet looking for big networks, we need to know what we’re looking for, and that’s what I have covered in this section.

Simply put, we need to ‘know thy enemy’. We can do that by reviewing past case studies. These are best as we can follow someone’s experience down the rabbit hole of network analysis and get an understanding of how we might be able to tackle them in the future.

A struggle of researchers is the time it takes to monitor networks on Twitter. So in this section I have also given an insight into my setup for monitoring Twitter activity.

Past Information Operation Case Studies

As mentioned above, researching case studies that have been transparent about their research method can lead to a practical insight into how to identify information operations in the future.

Two case studies I have published on this are:

Both of those investigations originated out of the capture of a network and analysis of the dataset.

Subsequent investigations sought to identify, where possible, the origins of the information campaign and those behind it.

Identification of a trend

So how do we find suspected networks?

The answer: become familiar with the platform and how these networks operate by looking at the data of past takedowns.

Learning about past networks gives you a general knowledge of information operations, such as their aim, appearance, and more importantly, where you might find them.

There are number of resources online to get into the data of past information operations on Twitter.

Below are some:

But what about that magic question: “how do we find them?”. Well, the first step starts with paying attention to the world of Twitter and what’s being said.

One way I tackle that problem is by monitoring hashtags in Tweetdeck.

Monitoring hashtags in Tweetdeck

The monitoring of hashtags is important in being able to detect when there might be automation, or a coordinated effort to either shift the narrative of a subject, or a boost of a specific topic.

You might be wondering why I keep mentioning automation. An automated Twitter account isn’t always bad, there’s actually some pretty fun automated accounts on Twitter.

But when there’s an army of fake accounts automating a political issue, community sensitivity, or attacking a company, then there’s smoke and fire.

For monitoring specific hashtags on Twitter, I like to use the monitoring functionality of Tweetdeck.

Simply add a hashtag using the ‘search’ style of column, as seen below.

This will give you a column which you can customise with content settings to choose specific languages, date range, tweets with only images and filtering out retweets.

When looking at networks not all of them use hashtags. Some might instead use a specific string of words or letters in their content, so you could use that as your monitoring list.

Anything you can identify that is unique to a subject you want to monitor is a starting point to pivot into data collection for analysis.

In the example in the image above I was monitoring the tag #bolivianohaygolpe, which loosely meant ‘no coup in Bolivia’. There was an abundance of tweets using this same tag, so it was the common link for me to use what tools I had available to monitor the trend.

Any common link you can identify is a great starting point to look at what data you want to collect, or scrape. For example, the propaganda operation below was captured using only three tags.

This has been the key starting point for most of my investigations and often the hardest, but once found, then follows a sequence of logical steps to identify the possible network.

What data do you need for a network visualisation?

Before we get into the details of exactly how to capture data from Twitter for network visualisations and analysis, we first need to identify what we require to make a network visualisation.

Take the following example illustration I have made below from a small cluster of five accounts in Gephi.

In the cluster above we have twitter accounts identified by their handles @a,@b,@c,@d,@e. In a network, these are called nodes.

Connecting them are the connections, (referred to in a network as edges). In this case, the edges are going ‘outwards’. That means Twitter account @a tweeted and mentioned @b,@c,@d and @e.

On a spreadsheet, the edges table (one that holds connections) looks like this, below.

If you wanted to see those labels in a Gephi format, you would need a nodes table (individual account information). This would be to label nodes. In Gephi, this would look like the following.

Please note, this is a very simple description of a small network of five accounts. There are books, courses and far more in-depth resources on network theory and analysis.

Also, much larger networks will not just be lists of account names. For a more in-depth analysis you will need attribute tables such as account creation date, weight, hashtags used, all of which will be covered further down in this report.

But what this smaller network does give us is the essential criteria of what a network looks like in CSV (comma separated value) format when read as a spreadsheet table.

The columns that hold the data you capture will define the links made between rows of a sheet.

So now that we know what necessary data we need to get from a potential network, let’s get into how we can scrape it.

Getting the essential raw data

In this section I am going to cover tools I find essential to capture data.

It will provide the following two alternatives on how I capture that data:

  1. Twitter scraping with Python; and
  2. Using Twitter’s API to capture data

This section will also cover how to sort raw data into a file that is friendly for processing in a visualisation platform (like Gephi).

Tools to capture data from Twitter

These are two tools I use to capture data from Twitter.

They are:

  1. Python script Twint (in conjunction with Table2Net to sort data)
  2. Gephi with TwitterStreamingImporter and Twitter API

Each of these are included in the following sections.

Capturing data using Python

Python is a general purpose programming language. There is an abundance of python programs that are capable of collecting data from platforms, however in these case-use scenarios I use Twint.

Twint is described as: “An advanced Twitter scraping & OSINT tool written in Python that doesn’t use Twitter’s API, allowing you to scrape a user’s followers, following, Tweets and more while evading most API limitations.

It is made by Francesco Poldi and is an easy-to-use tool to collect data from Twitter when requested through the command line.

Basically, what this does is remove the reliance upon buttons for using Twitter. Instead, we get to glimpse into the realms of user-generated data.

The GitHub page of Twint shows some of the basic commands that can be used in the script (see below). These are basic commands, and there is freedom to mix requests.

For the case study, the tag I was looking into was #bolivianohaygolpe – so a simple command in Twint allows me to pull all of the tweets that used the hashtag.

This is the command I enterred:

twint -s bolivianohaygolpe -o bolivianohaygolpe.csv --csv

This automates the collection of all of those tweets (seen below) into a CSV.

The same process can also be used to scrape mentions of a specific account.

For the #bolivianohaygolpe campaign, many of the tweets mentioned former President of Bolivia Evo Morales. This was either through replies of his Tweets or through mentions of his Twitter handle @evoespueblo.

I used the following command to collect all mentions of his Twitter handle:

twint -s evoespueblo -o evoesmentions.csv --csv

This builds a CSV sheet of all of those mentions and captures the following data seen below.

For further analysis of Evo Morales followers, we can use a command such as

twint -u evoespueblo --followers --user-full -o evoesfoll.csv --csv

This provides full follower information. For example, name, handle, verification, account creation date, followers, following, bio, location and more.

But be warned, it is a much slower process than the previous command, so you might want to make a cup of tea while you’re waiting.

Having the .csv extension again extracts this into an easy to view spreadsheet for analysis.

Using a filtering feature by column in excel, we can sort the data by values. I always like to check out account creation dates – as they’re indicative of odd behaviour when there’s a massive bunch made on one day.

This method of capturing and analysing the data is useful for identifying trends such as account name generators, same account creation dates and automated posting times which can all follow a strong pattern if automation is present.

You’re probably wondering “what else can I do with this data?” – so let’s look at how we can visualise it.

Converting captured data to graph

Using the visualisation tool Gephi is much easier to capture data from Twitter in a network format. More of this will be explained in the latter sections of this report.

But it is also possible to convert scraped data into a format friendly file for network visualisations.

This can be done by using the tool Table 2 Net.

Twitter user @hpiedcoq introduced me to this tool as a way to ‘sort’ data into network-friendly lists, and it is quite reliable for that use.

For the purpose of showing how this works, I’m going to use the CSV file I made when I scraped the mentions of Evo Morales using the tool Twint, seen in the previous section.

First, when you upload the CSV it will sort the data into columns.

I like to display this as a network based on citations.

For the following sections I set:

  • My nodes as account names (and comma separated)
  • Links are mentions as that is the command I requested in the scrape
  • Then build the network

Of course, you can set attributes so that once you have your visualisation you can identify different patterns in your nodes.

Once you have built that network, it will provide you with a GEXF file which you can upload straight to Gephi – or another visualiser.

A stronger method to automate the capturing of data from Twitter, and the visualisation of a network is with the tool Gephi, using the Twitter API.

Location capture tweets to graph

If you’re reading this and a bit lost, something else you can try right now to test this on is making a network-friendly file out of geo-tagged tweets.

For this example, I thought it would be fun to try out tweets geotagged to Canary Wharf in London. There’s lots of interesting people, places and things there.

We can get the coordinates by doing the following in Google Maps.

Then we can use Twint, with the following command:

twint -g="51.505312, -0.022900,1km" -o canarytweets.csv --csv

This will give you a dataset of tweets tagged in a specific location, of which you can either manually column sort, or use in Table2Net and Gephi. To visualise the data, I’ll cover that in a bit.

Capturing data using Twitter API

In order to use Gephi with the plugin TwitterStreamingImporter you must have developer access to the Twitter platform and generate an API key.

You can find the application for a Twitter developer account here.

The TwitterStreamingImporter was developed by Matthieu Totet. To install it, simply access it through the Gephi plugin panel, seen below.

Should you need assistance in setting up an API key or Twitter account, the team behind the TwitterStreamingImporter have a complete step-by-step guide here.

Once installed, enter API keys and you are able to start streaming through Twitter’s API whenever you need.

The function I use the most is to pull data based upon words to follow, much like what we did through the command line before. This is just an alternative, and easier, way.

The network logic I choose is generally user network.

The user network is based on the interaction between users, so any mention, retweet or quote will be captured and represented automatically in Gephi.

For the others, hashtag network will create a network of tags, emoji network is for emojis, Bernardamus is for user networks based on tags in tweets.

The full option is also very useful for individual accounts. It is a network using all Twitter activity, so tweets, tags, URLs and images.

When you first click connect, it will start pulling activity as it happens, through the API.

Depending on the word you choose, it may either take a long time as each tweet is made with that word, or alternatively it will crash your computer in minutes with overwhelming data.

For example, I chose the term “covid”. You can see how it pulled the data in below.

How to cook data and present it nicely in Gephi

The past two sections focussed on where to find possible inauthentic networks, the data you need to create a small network, and how you can scrape data from Twitter.

This section will now focus on the processing of that data in a visualisation platform so that you can visually analyse the data.

In this section I have relied primarily on the use of open source platform Gephi. While there are other visualisation methods out there, this is one that I find reliable and flexible.

An alternative visualisation platform I can recommend is Graphistry. A case study using this visualisation platform is from French security researcher fs0c131y in his analysis on the #GiletsJaunes tag on Twitter seen below.

For the purpose of the next two sections, I am going to use the data I scraped on Evo Morales mentions using Twint. That file was then processed with Table 2 Net to turn the csv data into a network-friendly file.

Subsequently, I will also use the data I pulled through Twitter’s API to show the visualisation and analysis.

How I display network visualisations

Starting with a clean, unprocessed block of data, there are a lot of things we can do in Gephi.

First, I always like to break up the data using one of the layout functions.

Dissuading hubs and preventing overlap allows for a more constructive analysis and visualisation.

This is your basic representation of the data with quick processing, but it can look much nicer than that.

How I organise nodes (accounts) into clusters

Once you have displayed your network, it’s time to classify clusters with the ‘modularity’ algorithm. The purpose of this is so you can then automatically apply individual colouring to each cluster.

Doing so will issue you with a modularity report which often has quite interesting details for analysis.

What we’re able to do now is use the partition colouring panel to partition the nodes based on modularity class and apply a palette to them.

Depending on the palette you choose you might want to change the background colour.

Doing so gives you the following result for your data.

Zooming in a bit more on that shows, for effect, what we have done.

There are many ways to display Gephi data for analysis. This is just one methods I have outlined.

Analysis in Gephi’s Data Laboratory

The benefit of using the TwitterStreamingImporter plugin in Gephi is that while data is being pulled into your Gephi visualisation through the API in real-time, you can also conduct an ongoing analysis of accounts and trends in the network as it happens.

For example, let’s take a look at the data we pulled using the API based on the hashtag #bolivianohaygolpe.

This is what some of the network looks like visually.

Over in the data laboratory this is how my edges table looks.

As I mentioned in some of the beginning sections, one of the things that is important to look for in identifying possible networks is the account creation date, which the data laboratory, much like a spreadsheet, allows you to sort through for visual pattern identification.

As seen above, we can tell that on November 11 and November 12, 2019, many of the accounts in the #bolivianohaygolpe network were created.

To show them clearly, we can bulk edit the node size.

This gives a direction as to what accounts need further investigating and allows us to focus our analysis time.

In some cases, using the data laboratory may also indicate signs of automation in the posting times. These can occur as patterns.

When looking at those accounts in their posting times, evident through the data gathered on Twint and the API requested data through Gephi, patterns can emerge as seen in this case study tweet I posted below.

This, however, is only a starting point with the data. It draws the difference from being able to collect and present data, to pivoting on the results found and diving further down the analytical rabbit hole.

Running a brief analysis of accounts

While I did say the investigative task from here requires manual diving into clusters and accounts, there are some ways to automate the flagging of suspicious accounts and networks.

Three tools I use for this purpose are:

  1. Botometer
  2. TwitterAudit
  3. Image reverse search (I use the RevEye plugin)

I’ll show you how I use them.

Botometer

One free tool I have used in a number of Twitter-based investigations is the Botometer. It’s a joint project of the Network Science Institute (IUNI) and the Center for Complex Networks and Systems Research (CNetS) at Indiana University.

Using Botometer is very simple. It can work in two ways, either through the user interface model on the website, or in an application through the API.

So let’s run the Botometer on some of the following accounts we identified were created on November 11, 2019. These are accounts that targeted the #bolivianohaygolpe tag.

Here is a sample of some of the results I got by running those accounts through Botometer.

Note that it will also conduct a sentiment analysis, content evaluation and other factors. This is quite useful as it saves the time of a researcher conducting this analysis by hand.

Something we can also do is run the same analysis on followers or friends of those accounts. This will automatically go through each one and conduct the same analysis.

TwitterAudit

Another tool that makes this quite simple is TwitterAudit. For me, this tool was not relevant in the investigation but it can show some insights on Twitter accounts with much larger follower numbers.

For example, this link is a search result of a Twitter audit of the account of Evo Morales.

For accounts in the public eye, this is quite common to have followers that might have been flagged as ‘fake’. The indicia for flagging as fake can include account age and whether the accounts posts – so a person’s account made for only reading Twitter but not participating could be flagged.

This is why there should always be a human check in an investigation, rather than the reliance upon tools for a conclusive result.

Image Reverse Search

A human check involves simple logic, such as checking the origin of a profile picture using an image-reverse search, as seen below.

To do this, I use the plugin RevEye, it helps for a quick reverse search on a choice of platforms, as seen below. The choice of using Evo Morales profile was for example purposes only.

Concluding remarks

The content I have covered in this report provides an introductory knowledge of the who, where and what of information operations on Twitter, two alternative methods of scraping data involving Python or Twitter’s API, and presenting that data in a visualisation platform.

Please do note that there may be alternative ways of looking at this data and different tools, however I have kept this report limited to the tools I have used in successful information operation detection cases.

While these are helpful tools for investigators, researchers and journalists, they are only a beginning to investigating an information operation. The findings made by looking at the data must be followed up in a qualitative analysis to make original findings.

I hope this information has been helpful to those of you conducting your own research, either on Twitter or other platforms.

4 Comments How I Scrape and Analyse Twitter Networks [Case Study]

  1. Javier

    Very Good Benjamin As a side matter that has nothing to do with your great work which I will follow, and try to replicate soon in my not so much democratic country, by curiosity is related to the hashtag you used for the study: #bolivianohaygolpe . In Spanish this is composed of 3 or four words depending on how you feel the moment. The hastag could be the union of four words just like you did “bolivia-no-hay-golpe”which translates to “no coup in bolivia” or it could be taken as three words “boliviano-hay-golpe” which means “bolivian there’s a coup.” As mentioned, this is just a curiosity which I had not brought it up if you hadn’t translated the hashtag.
    As before, great stuff and as hibiscus18 before me, let me know if you plan to do some kind of trainig or seminar

    Reply

Leave A Comment

Your email address will not be published. Required fields are marked *