Monday, November 15, 2010

Detecting Spam in a Twitter Network- Bridget Gelms

Technology is changing the way people communicate. Whether it’s by instant messenger, blogging, video chatting, or e-mail, people need to adapt to the ever-progressing forms of communication that the digital age is providing us with. One such way is Twitter. Twitter is a website where people can send out of a version of a text message from virtually anyway. With many different devices applicable to Twitter, this site makes it simple for people to send out a message. Unfortunately, if it’s easy for legitimate users it’s also easy for illegitimate users like spammers.
            Spam and spammer are words that were born out of the creation of the internet. Spam, as defined by Google, is the “abuse of electronic messaging systems (including most broadcast media, digital delivery systems) to send unsolicited bulk message indiscriminately” while spammers is said to be “a person or organization that sends spam.” Twitter and its structure allow spam to evolve with the times meaning that those junk-messages that used to be filtered through our e-mail account can now reach us through different modes of communication. Spam and spammers are constantly coming up with new and innovative ways to trick spam detectors and appear like they are normal everyday internet users. This, in turn, creates a need for researchers to come up with new and innovative ways to detect spam messages. Previously, spam was detected based on the content of the actual message. Now, spammers are beginning to be identified by how their spam is sent, which is harder to detect but is an element to spam that spammers can’t hide. This piqued the interest of researchers Sarita Yardi, Daniel Romero, Grant Schoenebeck, and Danah Boyd who conducted a study in an attempt to learn about spam on Twitter. 
The structure and functions of Twitter have changed the spamming game and this article identifies three main reasons. First, Twitter allows you to “follow” someone, even though they aren’t “following” you. Therefore, spammers can be connected to copious amounts of people without having to go through the work of having others follow them. This is something that sets Twitter apart from other social networking sites like Facebook and Myspace where both parties consent to being “friends” is necessary for someone to see your activity. Secondly, the ability to “re-tweet” (or “tweet” something that someone else has “tweeted”) gives them the opportunity to alter links that were originally to legitimate sites so that they now direct someone to the spam. URL shorteners are used so that links don’t take up the majority of the tweet (Twitter only allows its users to use 140 characters per tweet). These shorteners obviously don’t reveal the full name of the website, so those who click on the link trust it’s legitimate. This gives spammers a new way to conceal their spam links. Third, the user-generated aspect of Twitter means that tweeting patters change in regularity, amount, and circulation depending on its users. This allows spammers to use what’s being said on Twitter to their advantage.
            Another feature of Twitter is “hashtags”. By placing a hash mark in front of a group of words that describe what your tweet is about, it creates a link to all those other tweets that are also marked with that same hashtag. For example, someone might tweet “What is the deal with that smoke monster?!” and tag it “#LOST” in order to participate in a discussion of the popular TV show Lost. Twitter tracks what the most popular hashtags are and posts the top 10 list on the main page, making it extremely easy for thousands of people to engage in a conversation about the same thing, even if they aren’t “following” each other.
            This article reports finding of various research conducted about what users use Twitter for. While reasons vary, it’s been found that most people use Twitter for its “common ground” aspect. Even though they might only be following a few people that they actually know they are able to connect to many people that they don’t based on some sort of common ground, made possible by the hashtags.
            Specifically, this article tracks the growth of a hashtag- one that is generated solely within Twitter meaning its birth, life and death occurs only with Twitter. They give examples of #michaeljackson and #iranianelection as not being applicable to this study because outside influences could affect the progression of the hashtag. They use #robotpickuplines, a hashtag that started with user grantimahara, a user who has many followers because they are a host of the popular television show Myth Busters (and also a robot builder, hence the joke about robot pick-up lines). They track #robotpickuplines from the very first one, to everyone who participated in the conversation, all the way to the extinction of #robotpickuplines.
            Within two hours of grantimahara tweeting #robotpickuplines, it had become a “trending topic” and appeared on Twitter’s homepage (and remained there for at least three more hours) which gave the hashtag more viewership which led to new tweets with the tag #robotpickuplines and even re-tweets of existing ones. The study found that the most commonly re-tweeted robot pick-up line was “Hi my name’s Vista. Can I crash at your place tonight?” This study reports that they tracked #robotpickuplines over its lifespan of four days and found 17,803 tweets involving this tag which were generated from 8,616 different users. They also found that “user participation followed a power law distribution where 6,021 users tweeted one time, 2,595 tweeted two or more times, and a dedicated 205 tweeted 10 or more times using the #robotpickuplines.”
So where does information about spammers fit into this process? As I’ve identified before, spammers use certain tactics in order to infiltrate legitimate tweets such as including links to a URL, shortening the URL’s so that the site name is not visible (this then puts the responsibility on host sites such as www.bitly.com to sift out the spam), using more than one hashtag to cover a larger base, and sometimes they use “suggestive keywords” that never fail to get some hits such as “naked”, “girls”, or “webcam”.  With these tactics in mind, the researches followed a simple set of rules to detect spammers within the #robotpickuplines trend. They manually went through a random sampling of 300 tweets with the hashtag #robotpickuplines and marked each one as spam or not spam. They then used this set to compare the rest of the tweets to. The study reports that “our algorithm matched 91 percent of the time, with 27 missed spam tweets and 12 false positives.”
Using this information, they were able to discover a correlation between the lifecycle of the #robotpickuplines and spam spread through utilizing the hashtag. The rise in the amount of spam messages containing the tracked hashtag ran in conjunction with legitimate tweets containing #robotpickuplines with a slight lag. Despite the spammers taking about five hours to being latching onto this hashtag after grantimahara’s first posting, 14% of all tweets containing the tag #robotpickuplines were generated by spammers. Here are two bar graphs from the study illustrating their findings; the first shows the first 24 hours of #robotpickuplines. The study reports that the hashtag “started at 11am CST and spiked around 3pm when it became a trending topic. It dropped around 4am and picked up again, although less heavily, the next day.” The next graph is the spam tweets using #robotpickuplines.


The study also addresses a series of questions that may or may not affect the results. For instance, they research whether the age of the Twitter account is different between spam accounts and actual accounts. They also determine whether or not spammers tweet more often than regular users of Twitter and if spammers have more friends than followers linked to their account. Finally, they also examine whether or not spammers are grouped together. While their research is thorough and their findings to these questions are interesting, I think that what we can relate to our work in our Digital Literacies class comes from the information already presented. This study is relevant to our work in English 213 for a number of reasons, but I want to focus primarily on themes found in Clay Shirky’s Here Come Everybody, specifically chapters five and seven because they relate to this study in regards to why events within the study progressed the way that they did and perhaps why Twitter is such an easy target for spammers.
            In chapter 5 of Here Comes Everybody, Shirky includes a figure (found on page 129) that illustrates the principles of Twitter very well by showing the relationship between an audience and the pattern of conversation. The larger the audience, the looser the conversation will be. We see this in the study of the hashtag #robotpickuplines on Twitter in that not everyone who participated in that conversation heard everything else that was said within that conversation. Had the researchers tracked the life of a similar hashtag within a medium that has a much smaller audience than Twitter, the conversation would have been much more contained and not been a target for spammers. 
            So how has Twitter changed the way spam works? For e-mail, spammers had to look for who to target and how to do so. In Twitter, they now look for what to target and when to do so. For example, the “Trending Topics” table on the main page of Twitter lets users (and spammers) know what are the most popular items being discussed. As they begin “trending”, they build momentum. This can be a definite indicator for spammers as this takes care of the “who” and “how” that they used to have to figure out with email.
            Chapter seven in Here Comes Everybody also addresses issues related to Twitter and this study. Shirky stresses the importance of how the speed of various processes affects our everyday lives and he begins his chapter titled “Faster and Faster” by saying that “collective action is different from individual action”. This is seen in the study on the evolution of #robotpickuplines on Twitter because it took a massive amount of people to link up to this hashtag enough that spammers took an active interest in utilizing it for their benefit. Shirky directly discussed Twitter in chapter seven and describes it as being “simple” yet “compelling”. Shirky explains that websites like Twitter are primarily used for the benefit of small groups like people who read their friends’ tweets. However, he also describes instances where Twitter can be far more important in its uses as he describes a democracy activist in Egypt who has recorded his arrest through his Twitter account. It’s through this “simple” process that human beings can create something “compelling”.
            Much of Shirky’s book hones in on the idea that people now possess the tools necessary to organize efficiently and create a revolution by using these tools to our advantage. Twitter can be one such tool. While it might not seem like tweets involving #robotpickuplines isn’t that important, this study does demonstrate the power that Twitter can hold. A nerdy joke about robots was sent out into the digital realm in less than 140 characters. This single action that probably only took its creator a few seconds to do sparked a form of “organization” that resulted in connecting thousands of people to the same joke. This group of people commanded enough attention that spammers took interest. It was because of one simple message that thousands of people, who were otherwise not connected to each other in any way, were able to take part in a conversation instantly. The researchers created a visual representation of that conversation:

With each pink dot representing a Twitter account, this picture makes a statement about our abilities to organize and adopt new behaviors as people who are living in a digital world. This picture astonishingly shows just how far something like #robotpickuplines can extend within the span of a few hours.
Works Cited

Boyd, Danah, Daniel Romero, Grant Schoenebeck, & Sarita Yardi. “Detecting spam in a Twitter network.” First Monday Volume 15 Number 1-4 (January 2010): web. http://www.uic.edu/htbin/cgiwrap/bin/ojs/index.php/fm/article/view/2793/2431

Shirky, Clay. Here Comes Everybody. New York, New York: Penguin Books, 2008. Print.

 “define: spam”, “define: spammer”. Google (October 16, 2010): web. http://www.google.com

No comments:

Post a Comment