Data Collection and Analysis Tools from the SMaPP Lab
As part of our project to construct comprehensive data sets and to empirically test hypotheses related to social and political behavior and attitudes, we have developed a suite of tools and modeling processes now available for broader dissemination. Below, we list the tools, describe their functionality and outline how interested parties can best use them.
pysmap is a package for manipulating and analyzing data sets containing large numbers of tweets. The package uses python 3 , a high-level programming language, but it is user-friendly and intuitive. it is designed to be used by someone with minimal programming experience. you can visit the documentation on github to learn how to install pysmap and use it. it requires a very basic working knowledge of python. pysmap can use many different data sources like bson, json, and csv files as well as MongoDB as a backend data source. pysmap can be used for doing basic kinds of analyses like counting tweet features (like hashtags, mentions, etc), making basic graphs counting tweets, it can be used to filter on a set of premade filters, like language (detected or twitter ascribed), geo or non-geo enabled tweets. if you want a filter to be considered feel free to submit a github issue with your feature request.
smappdragon is a low level data parser designed for parsing twitter data. it is meant to be used by those with more experience programming in python (or programming in general). you can read the documentation on github. smappdragon can be used to write your own methods and filters for complex twitter data. it can be used to reduce the size of a dataset (stripping out unnecessary fields on entire data sets). it can be used to check individual fields in tweet objects easily. if you find a bug or have any suggestions for improvements we appreciate all feedback; submit a github issue.
The Toolkit serves as the foundational package for manipulating and analyzing data sets containing large numbers of tweets. The package uses python, a high-level programming language, but it is extremely user-friendly and intuitive. We discuss how to download python and set up your machine to take advantage of the Toolkit below, but lack of familiarity with the language will not preclude you from finding value in the tools.
The Toolkit uses MongoDB, an open-source document database, as its back end source. Once you have collected your data in this form, the Toolkit provides an abundance of ways to manipulate and analyze that data.
The Toolkit allows you to manipulate your collection of tweets in a variety of useful ways. You can always sort the data by the available metadata, which includes the tweeter’s name, screen name, preferred language, friends and followers, geographic location (where available) and number of tweets. You can segment the data according to time of tweet as well, allowing the interested researcher to see effects in “real time.” The Toolkit also allows the user to dump their manipulated data into .csv files, with columns delineated as chosen by the user.
The Toolkit allows the user to complete analysis at varying levels of complexity. At the simplest level, users can search their collection of tweets and count the occurrences of a particular word or set of words. They can characterize the tweets by language used or by location of tweeter (extremely useful for cross-national or cross-ethnic investigations). The Toolkit also provides graphing and figure-making functionality that allows the user to display data and insights in a visually appealing and informative manner. Some examples of visualizations can be found here.
For more detailed documentation, including installation instructions, see the SMaPP toolkit page on github here.
smappPy is a support package meant for high-level programming use. However, it offers a host of useful functions for those attempting to take advantage of social media data. It can help the researcher from start to finish, beginning with the collection of the data itself. smappPy contains tools that allow for greater ease in token pulling and any general interaction with the Twitter API.
Once data has been gathered, smappPy offers text cleaning modules and utilities that allow you to store your tweets in the appropriate Mongo database format. Additional manipulation functionality includes allowing users to pull images, hashtags and other interesting features from their collection.
smappPy offers users several advanced analysis options, including LDA topic modeling and tools that allow you to reconstruct Twitter networks from your collected data.
You can go to the smappPy page on github here for additional information, including installation instructions.
We are constantly adding to the capabilities of the Toolkit and smappPy, so users should ensure that they have the most recent updates to access full functionality.
All packages outlined here are python-based, and require that the user’s computer meet certain basic requirements. The lab has established a protocol for ensuring that your computer is properly set up before attempting to use the various tools. If you have additional questions, please feel free to email the lab.