Brown U niversity
Cheap Guanxi Identification In the Chinese Web CSCI 2950U Course Project
Weiyi Liu CS Department
September, 2011
Weiyi Liu
2
Background Researchers have noticed that the Chinese web culture might differ from that in the West. Hyperlinks between Chinese web sites might reflect the underlying manifestation of a complex dyadic social construct in China, known as “guanxi”[1]. Though the direct translation of “guanxi” is “relationship”, the concept as it is used and applied in Chinese culture is much richer. Here we give a definition of Guanxi: An informal, particularistic personal connection between two individuals who are bounded by an implicit psychological contract to follow the social norm of guanxi such as maintaining a long term relationship, mutual commitment, loyalty, and obligation. [2] A quality guanxi is characterized by the mutual trust and feeling developed between the two parties through numerous interactions following the self-‐disclosure, dynamic reciprocity, and long-‐term equity principles. A cheap guanxi is established for the sole purpose of sharing resource or profit. In other words, cheap guanxi is profit-‐ oriented. Once a side in a cheap guanxi could not afford the profit reciprocity, the guanxi will collapse.
Problem Statement As applied to web, mutual links which are established solely for promoting one’s own website, either in the aspect of increasing traffic or PageRank, might be a form of cheap guanxi. In other words, there are no requirements on the website content or purpose when two websites establish a cheap guanxi.1 The large-‐scale existence of cheap guanxi makes link analysis algorithms such as PageRank less effective, which might overstate the importance of some websites who trade links as resources with other websites. Our goal is to identify cheap guanxi websites in the Chinese web through studying the correlation between link pattern and PageRank. It is worthy to mention that, such phenomenon exists not only in China. There are mutual link trading websites found in other languages, such as Mutual Link Market (http://www.mutuallinkmarket.com). We select Chinese websites as our research resource, because the existence of mutual links based on cheap guanxi is more obvious in China, according to the research results presented in [2]. Further more, it is typical that a website has an exclusive part at the bottom of its homepage, called “Friendly Links” or “Collaboration Links”, which could be used to examine the rationality of our results. And we believe that, by studying a culture in which the dyadic relationship plays a critical role, we could get a better understanding of social networks.
1 E.g. from a travel agency web page, we find such a note: “Our website have PageRank score of 4 ... we welcomes friendly links. There are no requirements on content or purposes to link to our page; however, the web site has to be legal, with no viruses ... Each web site’s PageRank score has to be ≥ 4.”
Weiyi Liu
3
Assumptions Our work will base on the assumption that for websites establishing cheap guanxi, the objective is to gain PageRank or traffic. Based on Theorem 1 of [3], given website i and website j with PageRank P(i) and P(j), P(j)-‐P(i) is correlated to the amount of PageRank website i will gain or lose, if i establishes a mutual link with node j. So that websites generally prefer to establish cheap guanxi with websites of higher PageRank. However, since neither web site is willing to lose PageRank from the establishment of a mutual link, we speculate cheap guanxi websites tend to have similar PageRank scores with their partners. Based on these assumptions, some studies on Chinese websites have given an empirical threshold, which identify websites have at least one of the following features to be cheap guanxi websites: 1.
Websites that have >30% of their outgoing links reciprocated[2].
2.
Websites that have >50% of the neighboring websites connected by mutual links with PageRank difference of one or zero [2].
Proposal There will be two critical parts in our project, calculating PageRank and counting mutual links. Outgoing links without reciprocations will also be counted, while it is trivial comparing to the two critical parts. As a website may send a link through one page and receive a link back to another page, we consider mutual links between websites rather than merely mutual links between web pages. To make the whole process scalable, our work will be implemented on Hadoop. Since PageRank has already been well developed with MapReduce, we will be focusing on counting mutual links with MapReduce efficiently. As for now, we plan to accomplish these two jobs separately. Hopefully, we could find a way to combine them together into one job, after getting a better understand of MapReduce PageRank Algorithm. After these two parts, we will calculate several statistical data, such as the percentage of website’s outgoing links with reciprocation, the PageRank difference between two mutual linked websites, etc. Then we will use the empirical threshold mentioned in Assumption part to identify websites.
Hypotheses & Experiments Here we expect the result of identification to be rational. If it is not, one plausible reason is we are using an inappropriate empirical threshold. By tuning it, we might get a better result. There might be some other reasons, such as if the dataset is qualified or generalized enough, or if the given assumptions are correct. By analyzing the result, we could find out a rational explanation. To confirm the hypotheses, we will examine it in the following ways:
Weiyi Liu
4
1.
As we mentioned above in Problem Statement, typical Chinese cheap guanxi website usually has a key part on one of its webpages, which is called “Friendly links” or “Collaboration Partner Links”. By matching these textual hints with the result we have, we could figure out if the result is rational.
2.
Since a website who establishes a cheap guanxi is highly possible to appear in other similar cheap guanxi, we expect those websites identified to be cheap guanxi websites will compose into different small clusters according to their contents, user groups, locations, etc.
3.
If such clusters exist, we expect the PageRank of websites in a cluster to be similar or ranging smoothly.
Since it might be impossible to prove (either mathematically or logically) the identification result made by a certain threshold is correct, artificial judgment will play a critical role when examine the rationality of the result.
Reference [1] [2] [3]
V. King, L. L. Yu, and Y. Zhuang, “Guanxi in the chinese web-‐a study of mutual linking,” in Proceeding of the 17th international conference on World Wide Web, 2008, no. October, pp. 1161–1162. V. King, L. Yu, and Y. Zhuang, “Guanxi in the Chinese Web,” 2009 International Conference on Computational Science and Engineering, pp. 9-‐17, 2009. K. Avrachenkov and N. Litvak, “The Effect of New Links on Google Pagerank,” Stochastic Models, vol. 22, no. 2, pp. 319-‐331, Jul. 2006.