Blog Mining: What Links Bloggers Together? Aaron Davis aaron [email protected]
April 3, 2007 Abstract Recently there has been an increase in the user-generated content available from the World Wide Web in the form of web logs or blogs. Blogging is easily and commonly connected through syndication methods such as RSS. The connections imply a network of blogs that may be directly related to affinity networks.
A relatively new form of human interaction via the World Wide Web (web) is a practice called blogging. The is a colloquialism for a web log. A web log is characterized by generally having short dated entries by a person or group of people and accessible from the web. One who engages in blogging is referred to as a blogger. The evolution of blogging has brought with it tools for syndication. Bloggers frequently hope that other bloggers will syndicate something they wrote in order to get more exposure. While various methods for doing this exist today one of the most common formats for syndication is called RSS 1 2 . RSS is an XML format the summarizes the entries in a blog in a time horizon fashion. A process of crawling links between blogs to discover networks is a common method of initializing an affinity network. Indeed there is a fair amount of social capital 3 in simply knowing who is linked to whom. One drawback to this method is that it may not capture implicit links between bloggers who do no syndicate content from one another but share many common blogging topics. Another process for capturing an affinity network from blog data would be to run a clustering algorithm on the blog content. After creating clusters, links between bloggers can be inferred from blog entries that are part of the same cluster. This method overlays a structure which does not rely on direct linking but indirect linking based on inferred topics. Inferred linkage may not be desirable for some applications. Additionally, the only intuitive method for reviewing the structure would be to read full entries in the clusters. Yet another process would be to use a topic modeling framework such as Latent Dirichlet Allocation . By running such an algorithm we can infer similar topics for bloggers much like a cluster. However, because topics are are a mixture component generated from tokens in the document, a list of topic components (tokens) is produced. These topic components represent a reduced dimensionality and should represent highly predictive tokens for linking bloggers. This technique infers topic like the clustering technique which may be undesirable for some applications. This paper will look at each of these methods to see what kinds of affinity networks are created. A description of the data set used for experiments is included. Details for each of the methods will be given in the Experiments section. Results include a representation of the network for each method. Analysis and Conclusion will be brief to allow readers to determine appropriate application in their domain. A 1 http://www.rssboard.org/rss-specification 2 http://en.wikipedia.org/wiki/RSS
(file format) Capital
section on potential expansion and improvements to these processes followed by Acknowledgments will complete the paper.
Blog Data Set
One data set will be used for all experiments. This data set represents a collection of blog entries as described in section 2.1.
The blog data set used is updated nightly. Each blog was selected for inclusion in the list to check each night. The list came from Robert Scoble’s Bloglines 4 . The oldest blog entry in the data set is dated March 10, 2003. The static set of blog entries for all experiments in this paper range from March 10, 2003 to March 16,2007. Total entries during this time period equal 27,913. After joining to the bloggers table a total of 27,886 rows were returned. Each entry from the database includes a title, body, and published date. The feature set for this paper comes only from the non-markup tokens in the title and body. A blog entry is represented as a document vector or ”bag of words.” All tokens are separated by space or punctuation. All tokens are set to a lowercase representation of the character sequence.
The provided data set was extracted from blog entries via the Hypertext Transfer Protocol (HTTP). This data was then parsed and included in a series of tables in a relational database. In order to use this data for two of the experimental methods, I needed to extract the data from the database and write each entry as a single document. Markup was removed and the remaining tokens were written directly to the file. Files were named with an indexing scheme starting at 10000 and incrementing with each entry. An index file was created pointing to the files on the file system. For debugging purposes a series of index files was created with a reduced set of documents. All final experiments were run on the full data set described in section 2.1. For the Latent Dirichlet Allocation Method the files were passed on the command line rather than reading from and index file. Standard shell pattern characters were used to pass the full data set to the document vectorizing function.
Experiments were performed according to three methods. A brief description of each method is included below. In order to compare results from each experiment all three methods produce a n × n sparse triangular matrix where n represents the number of bloggers. The non-zero values of the triangular matrix represent the number of counted links between two bloggers according to the method used for link extraction. The diagonal of this matrix is not used as self-linkage is not reported in the outputted affinity network. d1,1 a1,2 · · · a1,n 0 d2,2 · · · a2,n An,n = . .. .. .. .. . . . 0 0 · · · dn,n ∀ai,j i = 1 . . . n and j = 1 . . . n IS the count of links between blogger i and j 4 http://www.bloglines.com/
In the Direct Linkage Method (DLM) the body of each blog entry is parsed for hyperlinks to other blog entries in the data set. When a hyperlink is found to another bloggers entry the corresponding position in the triangular matrix is updated.
Expectation-Maximization (EM) is an attempt to learn a hidden node representing a cluster in a mixture of multinomials.  This experimental method includes sweeping k to find the maximum log marginal likelihood for a given k on the data set. This k is then used in estimating clusters. Links are then made between bloggers na¨ıvely when each blogger has an entry in a shared cluster with another blogger.
Latent Dirichlet Allocation
Latent Dirichlet Allocation (LDA) is a relatively new technique for topic modelling and dimensionality reduction in natural language processing. For this experiment I used the MALLET open source toolkit.  MALLET implements Gibb’s Sampling to approximate probabilities of topic components given a corpus. The final output includes an approximation of the tokens from the corpus that are most likely to predict the topic. These tokens are then used in a query to the blog entries to infer membership of an entry in a topic. Similar to the previous method (EM), linkage is recorded for bloggers who share entries in the same topic.
In order to accommodate visualization of affinity networks in this paper a cutoff of %fill in with final results after creating suitable graph visualization% link counts was used. This has the effect of only showing a network of nodes with the strongest links.
Direct Link Figure 1: Affinity Network Graph: DLM
The results for sweeping k using EM show the optimal k to be 15. The results are displayed in Figure 2.
x1e+7 Iteration 1 Iteration 2 Iteration 3 Average
Log Marginal Likelihood
Figure 2: Sweeping k with EM
Latent Dirichlet Allocation
The output from discovering topic components using LDA is included in Tables 1 and 2. Table 1 represents only unigrams and appears to be more difficult to disambiguate tokens. Table 2 represents n-gram tokens and appears to be more unique. Consequently, linkage was established from the tokens in Table 2 only. Topic 0 1 2 3 4 5 6 7 8 9
Most Likely Topic Components news, public, story, case, government, law, copyright, information, state, report, source internet, bill, war, american, security, issue, patent, act, rights blog, community, great, people, talk, conference, world, show, team, book, blogging, today week, post, event, social, read, interesting, day, ll microsoft, windows, software, server, support, vista, applications, system, open, user, version security, release, office, access, application, update, users, desktop, development people, don, make, work, time, things, ve, good, lot, thing, point, doesn, problem, isn, idea making, long, ll, find, makes video, music, time, apple, read, power, email, real, comments, future, part, phone, life, game xbox, tv, office, device, digital, mobile company, business, companies, online, year, internet, service, market, million, media, content technology, customers, network, based, services, money, industry, youtube, deal article, de, link, full, read, la, city, en, research, march, food, world, university, science, study found, originally, high, health, starbucks day, time, back, ve, good, didn, year, life, home, years, night, week, love, don, ll, days, family hours, great, fun code, net, data, file, system, application, set, class, web, type, xml, add, test, create, object project, sql, java, asp, control google, web, site, search, http, blog, post, page, content, www, users, yahoo, email, sites, link tags, news, list, links, video Table 1: Unigram Results for 500 Iterations of LDA
Most Likely Topic Components united states, memeorandum permalink, cory doctorow, intellectual property, xeni jardin john edwards, supreme court, global warming, york times, boing boing, al gore, black hat prime minister, hillary clinton, barack obama, walter reed, vice president, dick cheney los angeles, united nations social media, robert scoble, technorati tags, san francisco, creative commons, bill gates stay tuned, silicon valley, doc searls, jason calacanis, dan farber, mike arrington, gillmor gang hugh macleod, press releases, san jose, dana gardner, dave winer, irina slutsky, jeff jarvis windows vista, open source, operating system, windows server, visual studio, windows xp sql server, windows mobile, technorati tags, operating systems, virtual pc, red hat windows home, daylight saving, windows sharepoint, sun microsystems, daylight savings virtual machine, steve ballmer, virtual earth open source, technorati tags, vast majority, creative commons, highly recommend, chris pirillo cell phones, highly recommended, vice versa, eric mack, social software, stay tuned walled gardens, dave winer, originally posted, holy grail, david allen, open sourced human beings, blue flavor documentary series, comments bold moves, platinum system packs, office depot featured gadget bring games, media center, windows media center, cell phone, steve jobs, windows mobile windows media, cell phones, windows media player, las vegas, media player, norton ghost sony ericsson, alexander grundner, dates sorted, windows ce social networking, press release, san francisco, vice president, credit card, social networks silicon valley, wall street journal, social network, york times, super bowl, san jose, venture capital fourth quarter, wall street, search engine, los angeles, palo alto, seeking alpha, san diego originally posted, accountancy age, david pescovitz, tom sanders, global warming, jim gray howard schultz, shaun nichols, san francisco, de la, hong kong, breast cancer, spinal cord srem regsvr, de software, south africa, salt lake, mechanical turk, breast augmentation, carbon dioxide tom harpel, mark frauenfelder, san francisco, originally uploaded, star wars, los angeles, york city technorati tag, star trek, guinea pig, cory doctorow, ice cream, hong kong, north carolina britney spears, battlestar galactica, jake snider, mardi gras, jon stewart, super bowl sql server, visual studio, technorati tags, compact edition, visual basic, windows workflow scott hanselman, windows forms, david hayden, windows presentation, stored procedures windows communication, stored procedure, starter kit, scott guthrie, lambda expressions windows powershell, scott watermasysk, jeff atwood, heavy lifting search engine, windows live, search engines, rss feed, techmeme permalink, rss feeds search results, movable type, social network, social networking, div xmlns, instant messaging crunch network, rss reader, personalized homepage, social bookmarking, steve rubel, search box rss bandit, social networks Table 2: N-gram Results for 500 Iterations of LDA
Possible future work includes determining a way to sweep the number of topics in LDA or possibly using the k from EM that represents the optimal number of clusters. The may produce fewer links which may be considered a false positive. Modifications to the MALLET toolkit or a new implementation will be required in order to accomplish this. Another potentially interesting future experiment would be to find a way to fold in new data. Additionally, selecting a time horizon for topics may be of interest. Temporal representations of affinity networks and their evolution could reveal patterns in blogger behavior. Finally, a better representation of linkage could be established. Perhaps one which is asynchronous between bloggers. In this paradigm a blogger who comments frequently on a few topics will strongly link to those who comment frequently on the same topics. However, the linked bloggers may have more varied interest and comment on more topics which may reduce the returning link from that blogger.
I would like to thank members of the Brigham Youn University Data Mining Lab who provided the blog data set used in this paper. Especially Matthew Smith and Brock Judkins who were instrumental in gathering and preparing this data set. Additionally, the code from the Brigham Young University Natural Language Processing Lab was a helpful base for preparing the EM algorithm and outputted document vector representations. Portions of that code base are due to the Natural Language Processing Lab at Standford.
References  David M. Blei, Andrew Y. Ng, and Michael I. Jordan. Latent dirichlet allocation. J. Mach. Learn. Res., 3:993–1022, 2003.  Andrew Kachites McCallum. http://mallet.cs.umass.edu, 2002.
 M. Meila and D. Heckerman. An experimental comparison of several clustering and initialization methods, 1998.