Attacking HITS (and not PageRank)

While I think PageRank is a very clever (though simple) idea, I’m not very sure about HITS. What this algorithms are for? For predicting the quality of a page on the Web based on all the links between pages. PageRank assumes that a page linked by many pages and linked by pages of high quality (recursive!) has a good quality, i.e. it is an authority. HITS is based on the notions of hub and authority: a good hub is a page that points to several good authorities; a good authority is a page that is pointed at by several good hubs.
So, why do I appreciate PageRank and less HITS? Because the latter can be easily attacked. The PageRank of this page depends only on the pages linking to this page and I cannot easily force everyone on the web to link to this page. It depends on what other pages decide to link and I have no power over it.
Conversely, according to HITS, the hubness of this page depend on the pages this page link to, and I have total power over the pages I link to! Do I want this page to become an hub about cars? It is enough to link to (what I think are) cars authorities: bmw, mercedes, ferrari, ford, renault, … (fiat is better not). Then do I want to exploit the hubness score this page got? I would simply link also to crappyCarsISell.com. HITS thinks this page is an hub and, since an hub by definition points to authorities, hence HITS thinks crappyCarsISell.com is a car authority.
What matters is Direction of links! I have no control on links that go in my page but I have total control in links that go out of my page. Anyway I think the work by Kleinberg is simply great but HITS does not take into account the fact that users will always try to game systems (especially, but not only, if they have an immediate benefit).
… I was almost forgotting the initial reason of this post: I got remind about HITS reading Lexical authorities in an encyclopedic corpus: a case study with Wikipedia by my friend Francesco, whose blog I just discovered today via a comment he left here. And this means one less friend without a blog! Welcome Francesco!

13 thoughts on “Attacking HITS (and not PageRank)

  1. Francesco

    Thank you for your warm welcome,

    regarding the main topic of your post, I think your analysis is interesting, but I somewhat disagree:

    1) It seems to me (I’m no expert, however) that the “hubness” is somewhat used only as a mean to compute “authority”, which is often the only value which is used in search engine rankings: of course you can create some fake “hub” pages to promote the authority of other pages, but this is the same stategy used to fool pagerank

    2) In the wikipedia experiment, we considered only authority, since hubness results were’t really significant or interesting (or perhaps our analysis was not accurate!)

    3) *More important:* links in web pages are “real content” (whereas, for example, a judgement about a product in a recomendation system is not a product). The consequence is that a “fake” hub… is still a good hub. To make a fake hub, you need to perform the time consuming (and useful) task of choosing authoritative pages, and this kind of work is exactly what “hubness” rewards!

    4) hubness and authority are orthogonal: you can “fake” a hub, but not an authoritative hub

    Francesco

  2. paolo

    Not very warm actually, I almost forgot to do it! ;-)

    anyway, I still believe that
    – it is easy to become an hub (i can create a bot in few minutes doing it: directories at dmoz are probably good hubs, creting a local mirror makes your local mirror a good hub)
    – then i add to my local mirror (with good hubness) the link to the page whose authority i want to increase.

    easy and working.
    i think hits cannot be used on the web maybe on some intranet web.

  3. paolo

    Not very warm actually, I almost forgot to do it! ;-)

    anyway, I still believe that
    – it is easy to become an hub (i can create a bot in few minutes doing it: directories at dmoz are probably good hubs, creting a local mirror makes your local mirror a good hub)
    – then i add to my local mirror (with good hubness) the link to the page whose authority i want to increase.

    easy and working.
    i think hits cannot be used on the web maybe on some intranet web.

  4. Cai

    Hmm, I do agree with Paolo and disagree with Francesco. In fact, I think that Paolo’s point is excellent :-)

    @Francesco: PageRank is *not* equally suceptive to attacks as HITS, as you have to create rank totally on your own: in other words, you have to “drill down” into the network and create enough peers to trust you (some sort of recursive back-stepping). On the other hand, for HITS, you just need two steps.

    1) Create good hubs, which is easy, you just link to good authorities and they cannot do anything against that since you can link to anything you want :-)

    2) create good authorities by having your good hubs link to your authorities.

    With PageRank, it doesn’t work since you need to have the good authorities link to do – and they’re not going to do it if you’re a malicious rank grabber ;-) On the other hand, with HITS, you can short-circuit this security mechanism through step 1).

    Have fun boys and I’m eagerly awaiting your comments!

    BTW: Hey Paolo, what’s up with you, long time no hear, my Italian friend!

  5. Cai

    Hmm, I do agree with Paolo and disagree with Francesco. In fact, I think that Paolo’s point is excellent :-)

    @Francesco: PageRank is *not* equally suceptive to attacks as HITS, as you have to create rank totally on your own: in other words, you have to “drill down” into the network and create enough peers to trust you (some sort of recursive back-stepping). On the other hand, for HITS, you just need two steps.

    1) Create good hubs, which is easy, you just link to good authorities and they cannot do anything against that since you can link to anything you want :-)

    2) create good authorities by having your good hubs link to your authorities.

    With PageRank, it doesn’t work since you need to have the good authorities link to do – and they’re not going to do it if you’re a malicious rank grabber ;-) On the other hand, with HITS, you can short-circuit this security mechanism through step 1).

    Have fun boys and I’m eagerly awaiting your comments!

    BTW: Hey Paolo, what’s up with you, long time no hear, my Italian friend!

  6. Zbigniew Lukasiak

    One countermeasure would be to normalize the hubness – that is divide the number of links to authoritative pages by the number of all links on the page. But still this would let to create sites with big hubness but still meaningless – because the a random list of authoritative pages is not very usefull.

    Personally I think the division to hubs and authorities is a bit artificial.

  7. gino

    Trying to understand what hubness means, I read now this ecellent post by Paolo, and I here to ask for a better understanding.

    Pagerank is a combination of quantitative and qualitative criterium: the qualitative component makes the incoming links not counting one each.

    HITS is again a combination of quantitative and qualitative criterium: the qualitative component makes the incoming and outgoing links not counting one each.
    As Francesco said, it’s easy to became hub (and hub is hub, “fake hub” does not make sense), not to became authoritative (and authoritative hub).

    Correct? So, HITS seems stronger than Pagerank: more reliant and fitting.

  8. gino

    Paolo, My objective is only a full understanding.
    Let’s focus on the way to game the system. In the PAGERANK I need to create fake pages linking to the page I want to make authoritative artificially. In HITS I need to create fake hubs linking to the page I want to make authoritative artificially. Is this a real difference, pro PAGERANK ?

  9. gino

    Thank you for naming stupid who does not understand, me included. Infact I still do not understand.

    Remember we’re focusing on how to game the system.

    From your original post: “PageRank assumes that a page linked by many pages and linked by pages of high quality (recursive!) has a good quality”. Even if the incoming link are taken into account here – so the direction of the arrow is known, dear – if I have the right skill – the same required do make a fake hub, more or less – again I can take control on that links, creating pages linking to my artificially-authoritative page. So I still don’t see the difference.

    To be honest, there’s one difference: in PAGERANK to make a page artificially authoritative you have to make a network (pyramid) of artificially-authorititative pages. Btw this is a job for an algorithm, so is it a relevant difference in gaming the system ?

    Thank you if you’ll be patient

  10. paolo Post author

    Stupid was not referred to you, it is a commonly used catch phrase
    See http://en.wikipedia.org/wiki/It%27s_the_economy,_stupid

    I don’t know how to make it more clear that what I already wrote:

    Pagerank, in order to say Paolo Massa is an authority considers only what other people say about Paolo Massa. Paolo Massa has no control on what other people say about Paolo Massa.

    HITS, in order to say Paolo Massa is a hub considers only what other Paolo Massa says about other people. Paolo Massa has total control on what Paolo Massa says about other people.

    Why? because you have easy access of outgoing arrows but not of incoming arrows (I cannot fake the fact “Richard Stallman says ‘Paolo Massa is the best programmer in the world”)

Leave a Reply

Your email address will not be published. Required fields are marked *