远见与创新: 2005.12

2005-12-25

Collaborative Tagging

The Structure of Collaborative Tagging Systems
Scott A. Golder and Bernardo A. Huberman, HP

Collaboration is an interesting topic. BT is an excellent example. Now del.icio.us is another.

Tagging & Taxonomy is two different concepts. The difference is: Tagging is non-hierarchical and inclusive, so flexible, easy.

Flexibility brings possible challenges because of the characteristic of language:

Polysemy: a word can have many senses. e.g. "window" can be a hole in wall, or the pane of glass
Synonymy: multiple words share one meaning, e.g. television or tv
Basic level variation: the words descripting an item can vary from general to very specific, e.g. cat, animal

User activity is interesting, we can track it to find his/her interesting and find friends.

Section 4.2 shows, the tags for a URL can reach a random stable proportion after about 100 bookmarks. In figure 7a,b. the number of finally stable tags for a URL is about 10. the biggest tag proportion is 20% and 40%. The stability is reached at 50 bookmarks and maintains until 400 bookmarks. This just shows, a URL's tags don't have dramatic changes. 50 users' tagging can reach an agreement that what is this URL.

This pattern is explained by Eggenberger and Polya model. The remarkable property of this model is, the RANDOM stable point. From this analog, just like the read and black are same attributes, the 10 tags also indicate the property of this URL at same extent, although their final stable proportion are different value. We should treat these 10 tags as same importance.

The authors' analysis about imitation and shared knowledge is correct.

There are also two trivial problems in this paper:

The authors think user use del.icio.us primarily for their own benefit, but constitute a useful public good. In fact, many users publish their bookmarks just like their publish their blog. We all know most of Blog is written for others to read, rather than the private diary. Just like Blog is a place to show personal interest, idea, finding, and stimulate the communication among people, del.icio.us is also such a system. The name of "collaborative tagging system" has uncover this point, i.e. collaboration is the key. At some extent, it is more like Digg. In Digg, users votes for the best interesting topic.
The authors think figure 4a,b shows a retrospective sensemaking process. But in the figure, when tage 3 increases continuously, the tag2 doesn't drop. So tag3 means a new tag that user want to maintain, rather than he want to refine the tag2.

[Thought]

Chinese language may be good at Tagging. Each chinese word has its context. Is there any research about it?
Need have a look of the Gaming theory for cooperation gaming. It is interesting!

[Future Reading]

Cloudalicious. http://cloudalicio.us/
Connotea. http://www.connotea.org/
CiteULike. http://www.citeulike.org/
Wu, F. & Huberman, B. (2005). Social Structure and Opinion Formation. http://www.hpl.hp.com/research/idl/papers/opinions/opinions.pdf

Technorati : Collaborative Tagging, del.icio.us

标签：未分类

Common API for Structured P2P

Towards a Common API for Structured Peer-to-Peer Overlays
Frank Dabek, MIT, Ben Zhao, John Kubiatowicz, Ion Stoica, Berkeley, Peter Druschel, Rice

Three typical service of KBR API ( Key-base Routing API)

DHT: distributed hash tables - put, remove, get
DOLR: decentralized object location and routing - publish, unpublish, sendToObj
CAST: group anycast and multicast - join, leave, multicast, anycast

Both DOLR and CAST maintain sets of endpoints in a decentralized manner and by their proximity in the network, using a tree consisting of the routes from the endpoints to a common root associated with the set. An integral part of DOLR process is the
maintenance of the distributed directory during changes to the underlying nodes or links. For CAST, Overlay nodes may join and leave a group, multicast messages to the group, or anycast a message to a member of the group.

The API defined:

Routing
- void route(key -> K, msg -> M, nodehandle -> hint)
- void forward(key <-> K, msg <-> M, nodehandle <-> nextHopNode)
- void deliver(key -> K, msg -> M)
Routing State Access
- nodehandle[] local_lookup(key -> K, int -> num, boolean -> safe): find next hops
- nodehandle[] neighborSet(int -> num): find neighbor
- nodehandle[] replicaSet(key -> K, int -> max_rank): find replica node
- update(nodehandle -> n, boolean -> joined): update for node join/leave
- boolean range(nodehandle -> N, rank -> r, key <-> lkey, key <-> rkey ): find node's serving key range

[Feeling]

Route is the essential key API for application.
Other API can be used for data maintainence or replicate.
If it is followed by current P2P practise?

Technorati : P2P API

标签：未分类

Search in DHT-based P2P

Complex Queries in DHT-based Peer-to-Peer Networks
Matthew Harren, etc. Ion Stoica Berkeley

Search is different from Querying. Search is a limited form of querying. Search is finding files whose names contain a given string, but Query allows combinations and correlations among things found. e.g. search music by "Bach", but query by "bach's chorales". This paper is about Query.

This paper shows Poor Query language is a serious limitations in P2P network. It thinks P2P's Poor scaling limition has been largely solved by DHT: DHT is extremely scalable, i.e. lookup can be resolved in log n ( or n exp(a) for small a).

For search, DHT has no problem. Hash indexes can be used directly for textual similarity searched. e.g. For example, a file with ID I and filename "Beethovens 9th" could be split into twelve trigrams: Bee, eet, eth, tho, hov, ove, ven, ens, ns%, s%9, %9t, 9th (where '%' represents the space character). For each such n-gram gi, the pair (gi; I) is inserted into the hash index, keyed by gi. One can build an index over n-grams for various values of n; it is typical to use a mixture of bigrams and trigrams. Then strict substring search or regual expression search can be done based on it.

The search can be expressed by SQL:

SELECT H.fileID, H.fileName
FROM hashtable H
WHERE H.text IN (<list-of-n-grams-in-search>)
GROUP BY H.fileID
HAVING COUNT(*) >= <#-of-n-grams-in-search>
AND H.fileName LIKE <substring expression>

So it thinks, the support of tranditional relational DB operation is necessary in P2P infrastrucutre. In P2P, the file/service attributes are distributed, the peers need cooperate to finish a query request.

So it introduce a three-tier software node implementation architecture for P2P Query. It also argue to implement an hierarchical table-related name space on DHT flat ID space, and use multicast routing to facilitate the query interaction between peers.

The planed supported DB operations are: selection, projection, join, grouping and
aggregation, and sorting. It also mentioned the characteristic of online query.

[Feeling]

It is more related to research about query of DB system. Knowledge is lack for me to get more insight for it. Anyway, the method for Hash search has told me search in DHT is no problem. That's great!

[Future Reading]

Haas, P. J., and Hellerstein, J. M. Online Query Processing: A Tutorial. In Proc. ACM-
SIGMOD International Conference on Management of Data (Santa Barbara, May 2001). Notes posted online at http://control.cs.berkeley.edu.
Hellerstein, J. M., Haas, P. J., and Wang, H. J. Online Aggregation. In Proc. ACM SIGMOD International Conference on Management of Data (1997).
Witten, I. H., Moffat, A., and Bell, T. C. Man-aging Gigabytes: Compressing and Indexing Documents and Images, second ed. Morgan Kaufmann, 1999.

Technorati : DHT, P2P, Search

标签：未分类

Zipf, Power-laws, Pareto

Zipf, Power-laws, and Pareto - a ranking tutorial
Lada A. Adamic, HP lab

They are all about small occurrences are common, whereas large instance are rare.

Zipf's law: y ~ r^(-b), y: size, r: rank
word frequency: 1932; city sizes: 1949

Pareto's law: P[X > x] ~ x^(-k)
Income distribution, 1896

power law distribution: P[X = x] ~ x^(-(k+1)) = x^(-a)
a = 1+k (where k is the Pareto distribution shape parameter).

Long-tailer is another words for them. And they shall be a line appeared on a log-log plot.

Zipf's law and the Internet
Lada A. Adamic, Bernardo A. Huberman, HP

Internet is Zipf's law. It differs from classic Erdös-Rényi random graphs model that has a Poisson node degree distribution.

The important attribute of Zipf's law is: It is Scale-free, i.e. elements that are much larger than the mean still have finite probability. On the other hand, in Gaussian distribution, e.g. human heights, they are extremely unlikely.

Huberman intuitive growth model explain the Zipf's law in Internet in 1999 with three assumptions:

Proportional page number growth Det(x) = K*x in each site

Page number growth rates are different among different sites

Exponential sites number growth

With these assumptions, Albert developed a new random graph growth model in 2002.

Ripeanu found P2P network E2E connections distribtution shifts from Zipf to two-sided Zipf in 2002. With this oberservation, Adamic shown in 2001, broadcast method need to be evolved to routing query to high degree nodes to obtain congestion relief and quick response.

[Questions]

Power-law network is a natural selection, but pure P2P is pursuiting a real even degree distribution. Is P2P fighting with natural rule?

How Broadcast method is proved to need to be evolved by Adamic?

[Future Reading]

Huberman, B. and Adamic, L. (1999). Growth Dynamics of the World Wide Web. Nature 401, 131.

Huberman, B. A. (2001). The Laws of the Web. The MIT Press.

Erdös, P. and Rényi, A. (1960). On the Evolution of Random Graphs. Publications of the Mathematical Institute of the Hungarian Academy of Sciences 5, 17-61.

Albert R. and Barabasi A.-L. (2002). Statistical mechanics of complex networks. Review of Modern Physics 74, 47-94.

Adamic, L.A., Lukose, R.M., Puniyani, A.R., and Huberman, B.A. (2001). Search in
Power-Law Networks. Physical Review E 64: 046135.

Technorati : Pareto, Power-laws, Zipf

标签：未分类

BitTorrent Analysis

Modeling and Performance Analysis of BitTorrent Like PeertoPeer Networks
Dongyu Qiu and R. Srikant, UIUC

Fluid model:
- Downloader: dx/dt = arrive rate - abandon rate - min {total download rate, total upload rate}
- Seed: dy/dt = min {total download rate, total upload rate}- seed leave rate
- By Little's law: T = 1/(abandon rate + 1/(max {1/download rate, (1/upload rate - 1/seed leave rate)/share efficiency }
- By eigenvalues of maxtrix: system is stable
- By stochastic Ornstein-Uhlenbeck process, analyze the variance of x and y
Peer selection:
- Assumption: selfish agent, non-cooperative game.
- Game theory: Nash equilibrium point exist if upload rate = download rate for each peer
Sharing Effectiveness
- Assumption: N pieces distributed uniformly.
- 1 - P{ no piece is required by others } = 1 - C(k)/N, k: one peer's connection number. In BT, up to 40. C(k) = 1 + 1/2exp(k) + .... + 1/Nexp(k)
- BT has good sharing efficiency.

[My Opinion]

The result is not so exciting. Anyway, it is just analysis. P2P is more an engineering problem. In Internet area, there is still lack of powerful analysis mathematic methods. TCP protocol design is an example, now P2P faces same problem.
Why use the non-cooperative game to analysis the peer selection. The core of BT is cooperation. It is wrong. Maybe it is because no other theory tool to anlayze this problem. So it is unsurprising that no meaningful result is obtained.
Fluid model is simple and easy to understand. Other methods includes model as a closed queueing system (D. Towsley), Markov chain (G. de Veciana).

[Future Reading]

M. Ripeanu, I. Foster, and A. Iamnitchi. Mapping the gnutella network: Properties of large-scale peer-to-peer systems and implications for system design.
IEEE Internet Computing Journal, 6(1), 2002.
T. S. Eugene Ng, Y.-H. Chu, S. G. Rao, K. Sripanidkulchai, and Hui Zhang.
Measurement-Based Optimization Techniques for Bandwidth-Demanding Peer-To-Peer Systems. In Proceedings of IEEE INFOCOM, 2003.

Technorati : BitTorrent, P2P

标签：未分类

BitTorrent Analysis

Modeling and Performance Analysis of BitTorrent Like PeertoPeer Networks
Dongyu Qiu and R. Srikant, UIUC

Fluid model:
- Downloader: dx/dt = arrive rate - abandon rate - min {total download rate, total upload rate}
- Seed: dy/dt = min {total download rate, total upload rate}- seed leave rate
- By Little's law: T = 1/(abandon rate + 1/(max {1/download rate, (1/upload rate - 1/seed leave rate)/share efficiency }
- By eigenvalues of maxtrix: system is stable
- By stochastic Ornstein-Uhlenbeck process, analyze the variance of x and y
Peer selection:
- Assumption: selfish agent, non-cooperative game.
- Game theory: Nash equilibrium point exist if upload rate = download rate for each peer
Sharing Effectiveness
- Assumption: N pieces distributed uniformly.
- 1 - P{ no piece is required by others } = 1 - C(k)/N, k: one peer's connection number. In BT, up to 40. C(k) = 1 + 1/2exp(k) + .... + 1/Nexp(k)
- BT has good sharing efficiency.

[My Opinion]

The result is not so exciting. Anyway, it is just analysis. P2P is more an engineering problem. In Internet area, there is still lack of powerful analysis mathematic methods. TCP protocol design is an example, now P2P faces same problem.
Why use the non-cooperative game to analysis the peer selection. The core of BT is cooperation. It is wrong. Maybe it is because no other theory tool to anlayze this problem. So it is unsurprising that no meaningful result is obtained.
Fluid model is simple and easy to understand. Other methods includes model as a closed queueing system (D. Towsley), Markov chain (G. de Veciana).

[Future Reading]

M. Ripeanu, I. Foster, and A. Iamnitchi. Mapping the gnutella network: Properties of large-scale peer-to-peer systems and implications for system design.
IEEE Internet Computing Journal, 6(1), 2002.
T. S. Eugene Ng, Y.-H. Chu, S. G. Rao, K. Sripanidkulchai, and Hui Zhang.
Measurement-Based Optimization Techniques for Bandwidth-Demanding Peer-To-Peer Systems. In Proceedings of IEEE INFOCOM, 2003.

Technorati : BitTorrent, P2P

标签：未分类

2005-12-11

BitTorrent

Incentives Build Robustness in BitTorrent
Bram Cohen

Peer Distribution

[Content]

Tracker return the interaction peer list to a peer. Random graphs algorithm is used to get good robustness, i.e. avoid the power law graph's unbalance.

Peers info each other the pieces it had obtained. Then Piece Selection.

[Questions]

How random is the Random Graphs algorithm? Does it take the proximity or anything else into consideration?

How many peers in the Tracker's return list?

Is the peer list no change after ONE Tracker-Peer interaction? Can Peer ask another list? Which situation Peer shall reask?

Piece Selection

[Content]

The basic principle is: Each one do for the optimization of the total system, e.g. Rarest First, Random First Piece. So, BT is intrincically a Cooperation system:

Peer only download one piece at one time, but it shall request sub-pieces of this piece from multiple peers.

[Questions]

Because the cooperation is limited into the peer lists returned from the tracker, so the list from tracker is Key? Anyone analyze and try to optimize the tracker algorithm?

Choking Algorithm

[Content]

Unchoke for 4 peers with quickest download rate. Rate is measured at 10s.

1 Optimistic unchoke for possible quicker download peer. Rotated at 30s. In the 30s, the business can be reached in both directions.

It seems it is based on an assumption that up and down link between peers are symmetric or near-symmetric, so the quickest download of one peers also means quickest download for the other peer, so peer shall select each other. The deal is reached. The precondition is:

- Peers have different pieces, so they need each other. So Piece Selection is important. Rarest first is key here.

4 thread make TCP saturate. 4 is also the net-ants default threads, which seems not to be occassional.

[Questions]

If the pair is not equal? e.g.

- link is not symmetic, i.e. uplink rate is lower than downlink rate.
- one peer has lower bandwidth than the other,

It is predicable that the low-capability peer shall be unchoked for high-capability peers. It is understandable, and there are methods in BT, e.g.

- The Random First Piece make the low-capablity peer has a rarest piece.
- The Seed shall prefer peers with noone else upload, so at least, Seed shall not abandon the low-capablility peers.

But is there any analysis for this issue?

[Futher Reading]

bitconjurer.org

Technorati : BitTorrent, P2P

标签：未分类

Gnutella

Peer-to-Peer Architecture Case Study: Gnutella Network
Matei Ripeanu, U Chicago

[Content]

Crawl and analyze Gnutella network from 10/2000 - 06/2001.

For save time, use Client/Server crawler. Single client need 50hr for 4000 nodes.

Power-law is a general distribution in natural, includes HTTP host connection, molecules in a cell, people in a social group, etc. N = K exp(-C)

Result:

- Power-law node-shared file number distribution
- Average node connectivity: 3.4
- Power-law node connectivity distribution in Nov. 2000, but in Mar. 2001, node with <10 connectivity is constant. >10 power-law distribution.
- Node-node shortest paths: Avg: 4.24 (Nov. 2000) -> 5.35 (May 2001), Largest: 12, > 95%: < 7 hops

[Questions]

In del.icio.us, people is connected by their bookmark. Believe the bookmark distribution is power-law. How about the shared bookmark distribution? Same questions for Technorati Tag distribution. If two people have multiple shared Bookmarks/Tag are same, we look them as friends. Its a good method to find friends. How about do a Web tool for it? i.e. Google Friends?

In Jun 2001, newer version of Gnutella make its overhead traffic cut from 64% to 8%, that's great! How and who do it?

[Futher Reading]

http://gnutella.wego.com/

Zipf's distribution: includes incomes, word frequencies, Gnutella query

- L. Adamic, Zipf, Power-laws, and Pareto - a ranking
  tutorial, online tutorial: http://www.hpl.hp.com/shl/

DSS group's Gnutella research

- DSS group, Gnutella: To the Bandwidth Barrier and
  Beyond, available at: http://dss.clip2.com, Nov. 6, 2000.
- DSS group, Bandwidth Barriers to Gnutella Network
  Scalability, available at: http://dss.clip2.com, Sept. 8, 2000.

Technorati : Gnutella, P2P, Power-law
Del.icio.us : p2p

标签：未分类

P2P for Personalized Service Infrastructure

P2P for Personalized Service Infrastructure
- my idea for P2P's research and development direction

Skype is an obvious success for P2P to replace traditional legacy communication service. In this voice service area, P2P has breaked the traditional telecom service model and make end user communicate End-to-End with Minimal Cost, i.e. the server can be avoided and the infrastructure have minimal cost but with excellent scalability.

P2P has the potential to brings same breakthrough in Internet service area. This is the direction.

It face challenges:

Firstly, there is not such a strong incentive for a end user to replace centralized Web Host service with P2P to minimize cost, he/she can find FREE host service in Internet now. Currently, most of revenue of Internet company is from advertisement, e.g. Google (4B), Shoping, e.g. Amazon, transaction intermediate fee, e.g. eBay with its Paypal. Many Web service are provided freely, e.g. Google provides Mail, Blog, Group, Search, Froogle, Calendar, Base, Map, etc. totally freely for end user.

Secondly, for a merchant who want to provides service in Internet, e.g. CISCO, DELL, for OAM convenience, they shall still reply on the centralized Internet server infrastructure to provide e-commerce service, i.e. rent a Server and related Host service, rather than use P2P network with Zero cost, because current P2P infrastructure still need hugh improvement in Performance.

But P2P still has hugh potential.

Look back at the history, we see, the biggest advantages for P2P is personal service in World. With P2P, the service provided by a personal can be reached to the height that anyone never image.

The best example is BT. Its killer point is: with it, a personel can provide Shared file service for total world with high capability and performance. That is totally different from the service provided in power-law network. It matches the services that require high network bandwidth are: big file download, video streaming, virtual reality, gaming, etc, which is currently hotest development area for P2P. We see this trend in both China and World.

P2P break the power-law network, which is required by current unstopped trend of Personality Service. Nature's trend is forming power-law network, i.e. the service and connection is provided centrally, which is economical. But on the other side, the trend of Personality is also unstopped. We has seen Blog has replaced BBS. Blog, Podcast, personlay shop on eBay is the hostest topic now.

For Personal Service, the most important tools are P2P and Web2.0.

P2P is a great tool to break power-law network. P2P give End User the capablility to communicate E2E.

Web2.0 provides a Web Service programming Enivornment. End User can access these Web service by API to provide presonalized service, e.g. Google Map API, Amazon API, etc.

Now, most of personalized service are provided in free Host Internet Service, which provides convenience and also limit the capability of Personal Service. Google has done much. We thanks for Google, but there maybe a end. Economically, if one want to deploy real service, money must be paid for the centralized web infrastructure.

The final solution is the distributed web service infrastructure. P2P play a Key role here. In this infrastructure, data and service shall saved and provided locally, the access shall be provided with P2P. Most current personal web service, e.g. Froogle, Blog, Google Base, Answer, File Sharing (Img, Video, Book, Music, Podcast, etc.), Steaming, Local, News, Mail, Talk, Group, Homepage, Event, etc. can be provided with Content Management Software locally with more convenience for content control and publish. Certainly, the precondition is the content/service provider is Always-on in network. And the non-real-time content/service can also be published and distributed in network by cooperation of Peers, then be accessed with better performance.

For reach this goal, what we need to do?

Firstly, the content/service publish tool, e.g. Open-Souce Content Management software. It is output is SWDL, RSS, Tag, etc. The output may includes:

- Service Type: Froogle, Blogger, Answer, File Sharing (Img, Video, Book, Music, Podcast, etc.), Steaming, Local, News, Mail, Talk, Group, Homepage, Event, etc.
- Service Key: Title, Name, Location, Area, Category, Owner, etc.

Secondary, the P2P infrastructure. Manage the node Join, Leave, Service publish, backout, cooperation, etc.

Thirdly, the content/service retrieve tool, e.g. a Google to find content/service according to SWDL, RSS, Tag, etc. Includes: search, category, directory, earth, maps.

Forthly, intrinsic network infrastructure application support, e.g. Paypal, analytics, adsense, adwords, etc.

The key is the P2P infrastructure. And the core of this infrastructure is Cooperation.

Current most P2P Research focus on the algorithm for following aspects:

Network establish and maintainence. e.g. node join, leave. The challenge is believed to be the Churn of the network.

Content publish and retrieve. The basic of the research is the key-based routing. The challenge is believed to be the performance under P2P Global distribution. Publish is like CDN, retrieve try to utilize Proximity.

We need note:

Search with multiple key words is essential. Search by regular expression is important if the system want to be used by end user. Let's assumpt tt can be workarounded by preprocessing to convert to multiple key words, so at least, search with multiple key words should be supported and it is essential.

User Performance is essential. It is the most key point that the system shall be used by end user. BT is an excellent example. We can obtain many exprience from it to improve the performance of P2P. This is an key research and enginneering area. Skype is not an accident. It use Super-node to improve the performance, although this method is not so clever and also is not fully distributted. CDN, Node cache, Multiple Peers Cooperation is another three methods to improve performance.

Incentives for Cooperation must be built into infrastructure. This is the valuable experience from BT.

Near-Optimization is OK. The CPU and Memory capacity of terminal and network bandwidth now can make them lower priority. This is also the reason why Kad is so emphasized by P2P developer now, and other algorithm is not.

Some feeling:

Security: The security problem shall mostly lie in application level. e.g. SPAM. The swindler shall not disappear in virtal world, but it is the task of the end user to find them. In fact, end user has enough experience in current Internet world. The transaction and lower level security is also not different from current Internet world.

Network Churn. This problem may be a little relaxed when each one has one Always-On network link, just like currently all web server should be always online to provide service. I believe the Always-On is a not-so-far goal. At least, now in China, the cost is about $15/Month by ADSL or LAN. The 3G cost is still unknown. If focus more on this problem, the history in Ad Hoc shall occurs again. In fact, the most important application in Ad Hoc now is: establish the network on site quickly, rather than deal with the network churn.

[Reference]

Paul Harrison mentioned the concept of "decentralized internet service" and some examples. The interesting PDF slides available here.

Technorati : P2P DHT
Del.icio.us : p2p

标签：未分类

远见与创新

2005-12-25

Collaborative Tagging

Common API for Structured P2P

Search in DHT-based P2P

Zipf, Power-laws, Pareto

BitTorrent Analysis

BitTorrent Analysis

2005-12-11

BitTorrent

Gnutella

P2P for Personalized Service Infrastructure

About

我的简介

Links

Previous

Archives