Social Networks and Web 2.0 papers at WWW2009

The recently announced list of accepted papers at WWW 2009 conference is at the end of this post. I’m particularly interested in the track “Social Networks and Web 2.0” and in the following papers:

• Ulrik Brandes, Patrick Kenis, Juergen Lerner and Denise van Raaij. Network Analysis of Collaboration Structure in Wikipedia
• Yutaka Matsuo and Hikaru Yamamoto. Community Gravity: Measuring Bidirectional Effects by Trust and Rating on Online (mentioning the Epinions dataset, maybe the dataset I released on Trustlet)
• Shilad Sen, Jesse Vig and John Riedl. Tagommenders: Connecting Users to Items through Tags
• Jérôme Kunegis, Andreas Lommatzsch and Christian Bauckhage. The Slashdot Zoo: Mining a Social Network with Negative Edges
• Cristian Danescu Niculescu-Mizil, Gueorgi Kossinets, Jon Kleinberg and Lillian Lee. How opinions are received by online communities: A case study on Amazon.com helpfulness votes
• Meeyoung Cha, Alan Mislove and Krishna Gummadi. A Measurement-driven Analysis of Information Propagation in the Flickr Social Network

User Interfaces and Mobile Web
Maryam Kamvar, Melanie Kellar, Rajan Patel and Ya Xu. Computers and iPhones
and Mobile Phones, oh my! A logs-based comparison of search users on different
devices.
Abstract: We present a logs-based comparison of search patterns across
three platforms: computers, iPhones and conventional mobile
phones. Our goal is to understand how mobile search users differ
from computer-based search users, and we focus heavily on the
distribution and variability of tasks that users perform from each
platform. The results suggest that search usage is much more
focused for the average mobile user than for the average
computer-based user. However, search behavior on high-end
phones resembles computer-based search behavior more so than
mobile search behavior. A wide variety of implications follow
from these findings. First, there is no single search interface which
is suitable for all mobile phones. We suggest that for the higherend
phones, a close integration with the standard computer-based
interface (in terms of personalization and available feature set)
would be beneficial for the user, since these phones seem to be
treated as an extension of the users’ computer. For all other
phones, there is a huge opportunity for personalizing the search
experience for the user’s “mobile needs”, as these users are likely
to repeatedly
Yu Zheng, Lizhu Zhang, Xing Xie and Wei-Ying Ma. Mining Interesting Locations and
Travel Sequences from GPS Trajectories for Mobile Users
Abstract: The increasing availability of GPS-enabled devices is changing the way
people interact with the Web, and brings us a large amount of GPS trajectories
representing people’s location histories. In this paper, based on multiple users’ GPS
trajectories, we aim to mine interesting locations and classical travel sequences in a
given geospatial region. Here, interesting locations mean the culturally important
places, such as Tiananmen Square in Beijing, and frequented public areas, like
shopping malls and restaurants, etc. Such information can help users understand
surrounding locations, and would enable travel recommendation. In this work, we
first model multiple individuals’ location histories with a tree-based hierarchical
graph (TBHG). Second, based on the TBHG, we propose a HITS (Hypertext Induced
Topic Search)-based inference model, which regards an individual’s access on a
location as a directed link from the user to that location. This model infers the
interest of a location by taking into account the following three factors. 1) The
interest of a location depends on not only the number of users visiting this location
but also these users’ travel experiences. 2) Users’ travel experiences and location
interests have a mutual reinforcement relationship. 3) The interest of a location and
the travel experience of a user are relative values and are region-related. Third, we
mine the classical travel sequences among locations considering the interests of
these locations and users’ travel experiences. We evaluated our system using a
large GPS dataset collected by 107 users over a period of one year in the real
world. As a result, our HITS-based inference model outperformed baseline
approaches like rank-by-count and rank-by-frequency. Meanwhile, when
considering the users’ travel experiences and location interests, we achieved a
better performance beyond baselines, such as rank-by-count and rank-by-interest,
etc.
Yuki Arase, Xing Xie, Manni Duan, Takahiro Hara and Shojiro Nishio. A Game Based
Approach to Assign Geographical Relevance to Web Images
Abstract: Geographical context is very important for images. Millions of images on
the Web have been already assigned latitude and longitude information. Due to the
rapid proliferation of such images with geographical context, it is still difficult to
effectively search and browse them, since we do not have ways to decide their
relevance. In this paper, we focus on the geographical relevance of images, which
is defined as to what extent the main objects in an image match landmarks at the
location where the image was taken. Recently, researchers have proposed to use
game based approaches to label large scale data such as Web images. However,
there is no in-depth study on the quality of collected game logs and how the logs
can improve existing applications. To answer these questions, we design and
implement a Web-based and multi-player game to collect human knowledge while
people are enjoying the game. Then we thoroughly analyze the game logs obtained
during a three week study with 147 participants and propose methods to determine
the image geographical relevance. In addition, we conduct an experiment to
compare our methods with a commercial search engine. Experimental results show
that our methods dramatically improve image search relevance. Furthermore, we
show that we can derive geographically relevant objects and their salient portion in
images, which is valuable for a number of applications such as image location
recognition.
Joshua Hailpern, Loretta Guarino Reid, richar Boardman and Srinivas Annam. WEB
2.0: BLIND TO AN ACCESSIBLE NEW WORLD
Abstract: With the advent of Web 2.0 technologies, Websites have evolved from
static pages to dynamic, interactive Web-based applications with the ability to
replicate common desktop functionality. However, for blind and visually impaired
individuals who rely upon screen readers, Web 2.0 applications force them to adapt
to an inaccessible use model. Many technologies, including WAI-ARIA, AJAX, and
improved screen reader support, are rapidly coming together. However, simply
combining them does not solve the problems of screen reader users. The main
contributions of this paper are two models of interaction for screen reader users, for
both traditional Websites and Web 2.0 applications. Further contributions are a
discussion of accessibility difficulties screen reader users encounter when
interacting with Web 2.0 applications, a user workflow design model for improving
Web 2.0 accessibility, and a set of design requirements for developers to ease the
user’s burden and increase accessibility. These models, accessibility difficulties,
and design implications are based directly on responses and lessons learned from
usability research focusing on Web 2.0 usage and screen reader users. Without the
conscious effort of Web engineers and designers, most blind and visually impaired
users will shy away from using new Web 2.0 technology in favor of desktop based
applications.
Cameron Braganza, Kim Marriott, Peter Moulder, Michael Wybrow and Tim Dwyer.
Scrolling Behaviour with Single- and Multi-column Layout
Abstract: The standard layout model used by web browsers is to lay text out in a
vertical scroll using a single column. The horizontal-scroll layout model – in which
text is laid out in columns whose height is set to that of the browser and the viewer
scrolls horizontally – seems well-suited to multi-column layout on electronic
devices. We describe a study that examines how people read and in particular the
strategies they use for scrolling with these two models when reading large textual
documents on a standard computer monitor. We compare usability of the models
and evaluate both user preferences and the effect of the model on
performance. Also interesting is the description of the browser and its user
interface which we used for the study.
Data Mining
David Stern, Ralf Herbrich and Thore Graepel. Large Scale Online Bayesian
Recommendations
Abstract: We present a probabilistic model for generating personalised
recommendations of items to users of a web service. The system makes use of
content information in the form of user and item meta data in combination with
collaborative filtering information from previous user behavior in order to predict
the value of an item for a user. Users and items are represented by feature vectors
which are mapped into a low-dimensional `trait space’ in which similarity is
measured in terms of inner products. The model can be trained from different types
of feedback in order to learn user-item preferences. Here we present three
alternatives: direct observation of an absolute rating each user gives to some
items, observation of a binary preference (like/ don’t like) and observation of a set
of ordinal ratings on a user-specific scale. Efficient inference is achieved by
approximate message passing involving a combination of Expectation Propagation
(EP) and Variational Message Passing. We also include a dynamics model which
allows an items popularity, a user’s taste or a user’s personal rating scale to drift
over time. By using Assumed-Density Filtering (ADF) for training, the model
requires only a single pass through the training data. This is an on-line learning
algorithm capable of incrementally taking account of new data so the system can
immediately reflect the latest user preferences. We evaluate the performance of the
algorithm on the MovieLens and Netflix data sets consisting of ~1,000,000 and
~100,000,000 ratings respectively. This demonstrates that training the model using
the on-line ADF approach yields state-of-the-art performance with the option of
improving performance further if computational resources are available by
performing multiple EP passes over the training data.
Sihong Xie, Wei Fan, Jing Peng, Olivier Verscheure and Jiangtao Ren. Latent Space
Domain Transfer between High Dimensional Overlapping Distributions
Abstract: Transferring knowledge from one domain to another is
challenging due to a number of reasons. Since both conditional
and marginal distribution of the training data and test data
are non-identical, model trained in one domain, when
directly applied to a diffeerent domain, is usually low in accuracy.
For many applications with large feature sets, such as
text document, sequence data, medical data, image data of
different resolutions, etc. two domains usually do not contain
exactly the same features, thus introducing large numbers of
“missing values” when considered over the union of
features from both domains. In other words, its marginal
distributions are at most overlapping. In the same time,
these problems are usually high dimensional, such as,
several thousands of features. Thus, the combination of high
dimensionality and missing values make the relationship in
conditional probabilities between two domains hard to measure
and model. To address these challenges, we propose
a framework that first brings the marginal distributions of
two domains closer by “filling up” those missing values of
disjoint features. Afterwards, it looks for those comparable
sub-structures in the “latent-space” as mapped from the
expanded feature vector, where both marginal and conditional
distribution are similar. With these sub-structures in
latent space, the proposed approach then find common concepts
that are transferable across domains with high probability.
During prediction, unlabeled instances are treated
as “queries”, the mostly related labeled instances from
out-domain are retrieved, and the classifcation is made by weighted

voting using retrievd out-domain examples. We formally
show that importing feature values across domains and latent
semantic index can jointly make the distributions of two
related domains easier to measure than in original feature
space, the nearest neighbor method employed to retrieve
related out domain examples is bounded in error when
predicting in-domain examples.
Jun Zhu, Zaiqing Nie, Xiaojiang Liu, Bo Zhang and Ji-Rong Wen. StatSnowball: a
Statistical Approach to Extracting Entity Relationships
Abstract: Traditional relation extraction methods require pre-specified relations
and relation-specific human-tagged examples. Bootstrapping systems significantly
reduce the number of training examples, but they usually apply heuristic-based
methods to combine a set of strict hard rules, which will cause limited
generalizability and thus low recall. Furthermore, existing bootstrapping methods
cannot perform open information extraction (Open IE), which can identify various
types of relations without requiring pre-specifications. In this paper, we propose a
statistical extraction framework called {\it Statistical Snowball} (StatSnowball),
which is a bootstrapping system and can perform both traditional relation extraction
and Open IE.
StatSnowball uses the discriminative Markov logic networks (MLNs) and softens
hard rules by learning their weights in a maximum likelihood estimate sense. MLN
is a general model, and can be configured to perform different levels of relation
extraction. In StatSnwoball, pattern selection is performed by solving an $\ell_1$-
norm penalized maximum likelihood estimation, which enjoys well-founded theories
and efficient solvers. We extensively evaluate the performance of StatSnowball in
different configurations on both a small but fully labeled data set and large-scale
Web data. Empirical results show that StatSnowball can achieve a significantly
higher recall without sacrificing the high precision during iterations with a small
number of seeds; and the joint inference of MLN can improve the performance.
Finally, StatSnowball is efficient and we have developed a working entity relation
search engine called {\it Renlifang} based on it.
Guan Hu, Jingyu Zhou and Minyi Guo. A Class-Feature-Centroid Classifier For Text
Categorization
Abstract: Automated text categorization is an important technique for many web
applications, such as document indexing, document filtering, and
cataloging web resources. Many different approaches have been
proposed for the automated text categorization problem. Among them,
centroid-based approaches have the advantages of short training time
and testing time due to its computational efficiency. As a result,
centroid-based classifiers have been widely used in many web
applications. However, the accuracy of centroid-based classifiers
is inferior to SVM, mainly because centroids found during the
training process are far from perfect locations.
We design a fast Class-Feature-Centroid (CFC) classifier for
multi-class, single-label text categorization. In CFC, a centroid is
built from two important class features: inter-class term
distribution and inner-class term distribution. CFC proposes
a novel combination of these features and employs a denormalized
cosine measure to calculating the similarity between a text vector
and a centroid. Experiments on the Reuters-21578 corpus and
20-newsgroup email collection show that CFC consistently outperforms
the state-of-the-art SVM classifiers on both micro-F1 and macro-F1
scores. Particularly, CFC is more effective and robust than SVM when
the training data is sparse.
Danushka Bollegala, Yutaka Matsuo and Mitsuru Ishizuka. Measuring the Similarity
between Implicit Semantic Relations from the Web
Abstract: Measuring the similarity between semantic relations that hold among
entities
is an important and necessary step in various Web related tasks such as
relation extraction, information retrieval and analogy detection.
For example, consider the case in which a person knows a pair of entities (e.g.
between which a particular relation holds (e.g. acquisition).
The person is interested in retrieving other such pairs
with similar relations (e.g. \textit{Microsoft, Powerset}).
Existing keyword-based search engines cannot be applied directly in this case
because, in keyword-based search, the goal is to retrieve documents that are
relevant to the words used in a query —
not necessarily to the relations implied by a pair of words.
We propose a relational similarity measure, using a Web search engine, to compute
the
similarity between semantic relations implied by two pairs of words.
Our method has three components:
representing the various semantic relations that exist between a pair of words using
automatically extracted lexical patterns,
clustering the extracted lexical patterns to identify the different patterns that
express a particular
semantic relation,
and measuring the similarity between semantic relations using a metric learning
approach.
We evaluate the proposed method in two tasks: classifying semantic relations
between named entities, and solving word-analogy questions.
The proposed method outperforms all baselines in a relation classification task
with a statistically significant average precision score of $0.74$.
Moreover, it reduces the time take by Latent Relational Analysis to process
$374$ word-analogy
questions from $9$ days to less than $6$ hours, with a SAT score of $51\%$.
Jiang-Ming Yang, Rui Cai, Yida Wang, Jun Zhu, Lei Zhang and Wei-Ying Ma.
Incorporating Site-Level Knowledge to Extract Structured Data from Web Forums
Abstract: Web forum has become an important data resource for many Web
applications, but extracting structured data from unstructured Web forum pages is
still a challenging task due to both complex page layout designs and unrestricted
user created posts. In this paper, we study the problem of structured data
extraction from various Web forum sites. Our target is to find a solution as general
as possible to extract structured data such as post title, post author, post time, and
post content from any forum site. In contrast to most existing information
extraction methods which only leverage the knowledge inside an individual page,
we incorporate both the page-level and site-level knowledge and employ Markov
logic networks (MLNs) to effectively integrate all useful evidences by learning their
importance automatically. The site-level knowledge includes (1) the linkages among
different object pages such as list pages and post pages, and (2) the
interrelationships of pages belonging to one same object. The experimental results
on 20 forums show very encouraging performance of information extraction, and
demonstrate the generalization ability of the proposed approach on various forums.
We also show that the performance is limited if only page-level knowledge is used,
while incorporating the site-level knowledge both precision and recall can be
significantly improved.
Detecting The Origin Of Text Segments Efficiently
Abstract: In the origin detection problem an algorithm is given a set $S$ of
documents, ordered by creation time, and a query document $D$. It
needs to output for every consecutive sequence of $k$ alphanumeric
terms in $D$ the earliest document in $S$ in which the sequence
appeared (if such a document exists.). Algorithms for the origin
detection problem can, for example, be used to detect the “origin” of
text segments in $D$ and thus to detect novel content in $D$. They
can also find the document from which the author of $D$ has copied
the most (or show that $D$ is mostly original.).
%We concentrate on solutions that use only a fixed amount of memory.
We propose novel algorithms for this problem and evaluate them together with a
large
number of previously published algorithms. Our results show that (1)
detecting the origin of text segments efficiently can be done with
very high accuracy even when the space used is less than 1$\%$ of
the size of the documents in $S$, (2) the precision degrades smoothly
with the amount of available space, (3) various estimation techniques
can be used to increase the performance of the algorithms.
Yue Lu, ChengXiang Zhai and Neel Sundaresan. Rated Aspect Summarization of
Abstract: Web 2.0 technologies have enabled more and more people to freely
comment on different kinds of entities (e.g. sellers, products, services). The large
scale of information poses the need and challenge of automatic summarization. In
many cases, each of the user generated short comments comes with an overall
rating. In this paper, we study the problem of generating a “rated aspect
summarization” of short comments, which is a decomposed view of the overall
ratings for the major aspects so that a user could gain different perspectives
towards the target entity. We formally define the problem and decompose the
solution into three steps. We demonstrate the effectiveness of our methods by
using eBay sellers’ feedback comments. We also quantitatively evaluate each step
of our methods and study how human agree on such summarization task. The
proposed methods are quite general and can be used to generate rated aspect
summary given any collection of short comments each associated with an overall
rating.
Ziv Bar-Yossef and Maxim Gurevich. Estimating the ImpressionRank of Web Pages
Abstract: The ImpressionRank of a web page (or, more generally, of a web site)
is the number of times users viewed the page while browsing search
results. ImpressionRank captures the visibility of pages and sites in
search engines and is thus an important measure, which is of interest
to web site owners, competitors, market analysts, and end users.
All previous approaches to estimating the ImpressionRank of a page
engine’s query log. In this paper we present the first {\em external}
algorithm for estimating the ImpressionRank of a web page. This
engine, the query suggestion service of the search engine, and the
web. In addition, the algorithm is {\em local} and uses modest
resources. It can therefore be used by almost any party to estimate
the ImpressionRank of any page on any search engine. Empirical
analysis of the algorithm on the Google and Yahoo!\ search engines
indicates that it is accurate and provides interesting observations on
sites and search queries.
Surajit Chaudhuri, Venkatesh Ganti and Dong Xin. Mining the Web to Facilitate Fast
and Accurate Approximate Match
attention in the literature. Many such tasks assume the existence of reference
entity tables. In this paper, we consider the problem of determining whether a
candidate string approximately matches with a reference entity. This problem is
important for
extracting named entities such as products or locations from a reference entity
table, or matching entity entries across heterogenous sources. Prior approaches
have relied on string-based similarity which only compare a candidate string and an
entity it matches with. In this paper, we observe that considering such evidence
across multiple documents significantly improves the accuracy of matching. We
develop efficient techniques which exploit web search engines to facilitate
approximate matching in the context of our proposed similarity functions. In an
extensive experimental evaluation, we demonstrate the accuracy and efficiency of
our techniques.
jong wook kim, K. Selcuk Candan and Junichi Tatemura. Efficient Overlap and
Content Reuse Detection in Blogs and Online News Articles
Abstract: The use of blogs to track and comment on real world (political, news,
entertainment) events is growing. Similarly, as more individuals start relying on the
Web as their primary information source and as more traditional media outlets try
reaching consumers through alternative venues, the number of news sites on the
Web is also continuously increasing. Content-reuse, whether in the form of
extensive quotations
or content borrowing across media outlets, is very common in blogs and news
entries outlets tracking the same real-world event. Knowledge about which web
entries re-use content from which others can be an effective asset when organizing
these entries for presentation. On the other hand, this knowledge is not cheap to
acquire: considering the size of the related space web entries, it is essential that
the techniques developed for identifying re-use are fast and scalable. Furthermore,
the dynamic nature of blog and news entries necessitates incremental processing
for reuse detection. In this paper, we develop a novel qSign algorithm that
effciently and effectively analyze the blogosphere for quotation and reuse
identification. Experiment results show that with qSign processing time gains from
10X to 100X are possible while maintaining reuse detection rates of upto 90%.
Furthermore, processing time gains can be pushed multiple orders of magnitude
(from 100X to 1000X) for 70% recall.
deepak agarwal, Bee-Chung Chen and Pradheep Elango. Spatio-Temporal Models
for Estimating Click-through Rate
Abstract: We propose novel spatio-temporal models to estimate click-through
rates in the context of content recommendation. We track article CTR at a fixed
location over time through a dynamic Gamma-Poisson model and combine
information from correlated locations through dynamic linear regressions,
significantly improving on per-location model. Our models adjust for user fatigue
through an exponential tilt to the first-view CTR (probability of click on first article
exposure) that is based only on user-specific repeat-exposure features. We
illustrate our approach on data obtained from a module (Today Module) published
regularly on Yahoo! Front Page and demonstrate significant improvement over
commonly used baseline methods. Large scale simulation experiments to study the
performance of our models under different scenarios provide encouraging results.
Throughout, all modeling assumptions are validated via rigorous exploratory data
analysis.
Lei Tang, Suju Rajan and Vijay Narayanan. Large Scale Multi-Label Classification via
MetaLabeler
Abstract: The explosion of online content has made the management of such
content non-trivial. Web-related tasks such as web page categorization, news
filtering, query categorization, tag recommendation, etc. often involve the
construction of multi-label categorization systems on a large scale. Existing multi-
label classification methods either do not scale or have unsatisfactory performance.
In this work, we propose MetaLabeler to automatically determine the relevant set of
labels for each instance without intensive human involvement or expensive cross-
validation. Extensive experiments conducted on benchmark data show that the
MetaLabeler tends to outperform existing methods. Moreover, MetaLabeler scales to
millions of multi-labeled instances and can be deployed easily. This enables us to
apply the MetaLabeler to a large scale query categorization problem in Yahoo!,
yielding a significant improvement in performance.
Liangda Li, Ke Zhou, Gui-Rong Xue, Hongyuan Zha and Yong Yu. Summarization
through Structure Learning with Diversity, Coverage and Balance
Abstract: Document summarization has played an ever more important role with
the exponential growth of documents on the web. Many supervised and
unsupervised approaches have been proposed to extract summaries from
documents. However, these approaches seldom consider summary
diversity, coverage, and balance issues which to a large extent
determine the quality of summaries. In this paper we consider
extract-based summarization with emphasis placed on three
requirements: 1) diversity in summarization which seeks to reduce
redundancy among sentences in the summary; 2) sufficient coverage
which focuses on avoiding loss of key information of the document in
the summary; and 3) balance which demands a equal amount of
information about different aspects of a document in the summary. We
formulate the extract-based summarization problem as learning a
mapping from a set of sentences of a given document to a subset of
the sentences that satisfies the above three requirements. The
mapping is learned by incorporating several constraints in a
structured learning framework to enhance diversity, coverage and
balance of the summaries. Furthermore, we explore a graph structure
of the output variable in the structure learning problem and employ
structured SVM for its solution. Experiments on the DUC2001 data
sets demonstrate significant improvement of performance in terms of
the F1 and ROUGE metrics.
Aleksandra Korolova, Krishnaram Kenthapadi, Nina Mishra and Alexandros Ntoulas.
Releasing Search Queries and Clicks Privately
Abstract: The question of how to publish an anonymized search log was brought to
the forefront by a well-intentioned, but privacy-unaware AOL search log
release. Since then a series of ad-hoc techniques have been proposed in the
literature, though none are known to be provably private. In this paper, we take a
major step towards a solution: we show how queries, clicks and their associated
perturbed counts can be published in a manner that rigorously preserves
privacy. Our algorithm is decidedly simple to state, but non-trivial to analyze. On
the opposite side of privacy is the question of whether the data we can safely
publish is of any use. Our findings offer a glimmer of hope: we demonstrate that a
non-negligible fraction of queries and clicks can indeed be safely published via a
collection of experiments on a real search log. In addition, we select an application,
finding similar queries, and show that the similar queries discovered on the original
data resemble those found on the perturbed data.
Purnamrita Sarkar and Andrew Moore. Fast Dynamic Reranking in Large Graphs
Abstract: In this paper we consider the problem of re-ranking search results
by incorporating user feedback. We present a graph theoretic measure for
discriminating
irrelevant results from relevant results using a few labeled examples provided by
the user.
The key intuition is that nodes relatively closer (in graph topology) to the relevant
nodes
than the irrelevant nodes are more likely to be relevant.
We present a simple sampling algorithm to evaluate this measure at specific nodes
of interest, and an efficient
branch and bound algorithm to compute the top $k$ nodes from the entire graph
under this measure.
On quantifiable prediction tasks the introduced measure outperforms other
diffusion-based proximity measures which take only the positive relevance feedback
into account. On the entity-relation graph built from the authors and papers of
the entire DBLP citation corpus ($1.4$ million nodes and $2.2$ million edges) our
branch and bound
algorithm takes about $1.5$ seconds to retrieve the top $10$ nodes w.r.t. this
measure with $10$ labeled nodes.
Fan Guo, Chao Liu, Tom Minka, Yi-Min Wang and Christos Faloutsos. Click Chain
Model in Web Search
Abstract: Given a terabyte click log, can we build an efficient and effective click
model? It is commonly believed that web search click logs are a gold mine for
search business, because they reflect users’ preference over web documents
presented by the search engine. Click models provide a principled approach to
inferring user-perceived relevance of web documents, which can be leveraged in
numerous applications in search businesses. Due to the huge volume of click data,
scalability is a must.
We present the click chain model (CCM), which is based on a solid, Bayesian
framework. It is both scalable and incremental, perfectly meeting the computational
challenges imposed by the voluminous click logs that constantly grow. We conduct
a reproducible experimental study on a data set containing 8.8 million query
sessions obtained in July 2008 from a commercial search engine. CCM consistently
outperforms two state-of-the-art competitors in a number of metrics, with over
12% better log-likelihood, more than 6% improvement in perplexity and 10%
improvement in the prediction quality of the first and the last clicked position.

Internet Monetization
Ashish Goel and Kamesh Munagala. Hybrid Keyword Search Auctions
Abstract: Search auctions have become a dominant source of revenue generation
on the Internet. Such auctions have typically used per-click bidding and pricing. We
propose the use of hybrid auctions where an advertiser can make a per-impression
as well as a per-click bid, and the auctioneer then chooses one of the two as the
pricing mechanism. We assume that the advertiser and the auctioneer both have
separate beliefs (called priors) on the click-probability of an advertisement. We first
prove that the hybrid auction is truthful, assuming that the advertisers are risk-
neutral. We then show that this auction is superior to the existing per-click auction
in multiple ways:
1) We show that risk-seeking advertisers will choose only a per-impression bid
whereas risk-averse advertisers will choose only a per-click bid, and argue that
both kind of advertisers arise naturally. Hence, the ability to bid in a hybrid fashion
is important to account for the risk characteristics of the advertisers.
2) For obscure keywords, the auctioneer is unlikely to have a very sharp prior on
the click-probabilities. In such situations, we show that having the extra
information from the advertisers in the form of a per-impression bid can result in
significantly higher revenue.
3) An advertiser who believes that its click-probability is much higher than the
auctioneer’s estimate can use per-impression bids to correct the auctioneer’s prior
without incurring any extra cost.
4) The hybrid auction can allow the advertiser and auctioneer to implement
complex dynamic programming strategies to deal with the uncertainty in the click-
probability using the same basic auction. The per-click and per-impression bidding
schemes can only be used to implement two extreme cases of these strategies.
As Internet commerce matures, we need more sophisticated pricing models to
exploit all the information held by each of the participants. We believe that hybrid
auctions could be an important step in this direction. The hybrid auction easily
extends to multiple slots, and is also applicable to scenarios where the hybrid
bidding is per-impression and per-action (i.e. CPM and CPA), or per-click and per-
action (i.e. CPC and CPA).
Gagan Aggarwal, S Muthukrishnan, David Pal and Martin Pal. General Auction
Abstract: In sponsored search, a number of advertising slots is available on a
search results page, and have to be allocated among a set of
advertisers competing to display an ad on the page. This gives rise
to a bipartite matching market that is typically cleared
by the way of an automated auction. Several
auction mechanisms have been proposed, with variants of the
Generalized Second Price (GSP) being widely used in practice.
There is a rich body of work on bipartite matching
markets that builds upon the stable marriage model of Gale and Shapley
and the assignment model of Shapley and Shubik.
This line of research offers deep insights into the structure of stable
outcomes in such markets and their incentive properties.
In this paper, we model advertising auctions in terms of an
assignment model with linear utilities, extended with bidder and item
specific maximum and minimum prices. Auction
mechanisms like the commonly used GSP or the well-known Vickrey-Clarke-Groves
(VCG) can be interpreted
as simply computing a \emph{bidder-optimal
stable matching} in this model, for a suitably defined set of bidder
preferences, but our model includes much richer bidders and preferences.
We prove that in our model the existence of a stable
matching is guaranteed, and under a non-degeneracy assumption a
bidder-optimal stable matching exists as well. We give a fast algorithm to
find such matching in polynomial time, and use it to design truthful
mechanism that generalizes GSP, is truthful for profit-maximizing
bidders, correctly implements features like bidder-specific minimum
prices and position-specific bids, and works for rich mixtures of bidders and
preferences.
Our main technical contributions
are the existence of bidder-optimal matchings and (group)
strategyproofness of the resulting mechanism, and are proved by induction
on the progress of the matching algorithm.
Jun Yan, Ning Liu, Gang Wang, wen zhang, yun jiang and Zheng Chen. How much
the Behavioral Targeting can Help Online Advertising?
Abstract: Behavioral Targeting (BT) attempts to deliver the most relevant
important role in online advertising market. However, there have been not any
public works investigating on how much the BT can truly help online advertising in
commercial search engines? To answer this question, in this paper we provide an
empirical study on the ads click-through log collected from a commercial search
engine. From the comprehensively experimental results on the sponsored search
log of a commercial search engine over a period of seven days, we can draw three
important conclusions: (1) Users who clicked the same ad will truly have similar
behaviors on the Web; (2) The Click-Through Rate (CTR) of an ad can be averagely
improved as high as 670% by properly segmenting users for behavioral targeted
advertising; (3) Using the short term user behaviors to represent users is more
effective than using the long term user behaviors for BT. The statistical t-test
verifies that all conclusions drawn in the paper are statistically significant. To our
best knowledge, this work is the first empirical study for BT on real world ads click-
Arpita Ghosh, Benjamin Rubinstein, Sergei Vassilvitskii and Martin Zinkevich.
Abstract: Motivated by the emergence of auction-based marketplaces for display
ads such as the Right Media Exchange, we study the design of a bidding agent that
implements a display advertising campaign by bidding in such a marketplace. The
bidding agent must acquire a given number of impressions with a given target
spend, when the highest external bid in the marketplace is drawn from an {\em
unknown} distribution $\cP$. The quantity and spend constraints arise from the
fact that display ads are usually sold on a CPM basis. We consider both the full
information setting, where the winning price in each auction is announced publicly,
and the partially observable setting where only the winner obtains information
about the distribution; these differ in the penalty incurred by the agent while
attempting to learn the distribution. We provide algorithms for both settings, and
prove performance guarantees using bounds on uniform closeness from statistics,
and techniques from online learning. We experimentally evaluate these algorithms:
both algorithms perform very well with respect to both target quantity and spend;
further, our algorithm for the partially observable case performs nearly as well as
that for the fully observable setting despite the higher penalty incurred during
learning.
Vahab Mirrokni, Eyal Even-Dar, Yishay Mansour, S Muthukrishnan and Uri Nadav.
allows an advertiser to target a large number of queries by bidding only
on a limited number of queries as broad match.
While giving more expressiveness to advertisers, this feature makes it challenging
for the advertisers to find bidding strategies to optimize their returns:
choosing to bid on a query as a broad match because it provides
high profit results in one bidding for related queries which may yield low or even
negative profits.
We abstract and study the complexity of the {\em bid optimization problem} which
is to
a subset of keywords (possibly using broad match) so that her profit is maximized.
In the query language model when the advertiser is allowed to bid on all queries as
an linear programming (LP)-based polynomial-time algorithm for the bid
optimization problem.
In the model in which an advertiser can only bid on keywords, ie.,
a subset of keywords as an exact or broad match, we show that this problem is not
approximable
within any reasonable approximation factor unless P=NP. To deal with this hardness
result,
we present a constant-factor approximation when the optimal profit significantly
exceeds the cost. This algorithm is based on rounding
a natural LP formulation of the problem.
Finally, we study a budgeted variant of the problem, and show that in the query
language model, one
can find two budget constrained ad campaigns in polynomial time that implement
the optimal bidding strategy.
Our results are the first to address bid optimization under the broad match feature
which is
Thomas Meinl and Benjamin Blau. Web Service Derivatives
Abstract: Web service development and usage has shifted from simple information
processing services to high-value business services that are crucial to productivity
and success. In order to deal with an increasing risk of unavailability or failure of
mission-critical Web services we argue the need for advanced reservation of
services in the form of derivatives.
The contribution of this paper is twofold: First we provide an abstract model of a
market design that enables the trade of derivatives for mission-critical Web
services. Our model satisfies requirements that result from service characteristics
such as intangibility and the impossibility to inventor services in order to meet
fluctuating demand. It comprehends principles from models of incomplete markets
such as the absence of a tradeable underlying and consistent arbitrage-free
derivative pricing.
Furthermore we provide an architecture for a Web service market that
implements our model and describes the strategy space and interaction of market
participants in the trading process of service derivatives. We compare the
underlying pricing processes to existing derivative models in energy exchanges,
discuss eventual shortcomings, and apply Wavelets to analyze actual data and
extract long- and short-term trends.
Performance, Scalability and Availability
Zakaria Al-Qudah, Hussein Alzoubi, Mark Allman, Michael Rabinovich and Vincenzo
Liberatore. Efficient Application Placement in a Dynamic Hosting Platform
Abstract: Web hosting providers are increasingly looking into dynamic
hosting to reduce costs and improve the performance of their
platforms. Instead of provisioning fixed resources to each
customer, dynamic hosting maintains a variable number of
application instances to satisfy current demand. While existing
research in this area has mostly focused on the algorithms
that decide on the number and location of application
instances, we address the problem of efficient enactment of
these decisions once they are made. We propose a new approach
to application placement and experimentally show
that it dramatically reduces the cost of application placement,
which in turn improves the end-to-end agility of the
hosting platform in reacting to demand changes.
Toyotaro Suzumura, Michiaki Tatsubori, Scott Trent, Akihiko Tozawa and Tamiya
Onodera. Highly Scalable Web Applications with Zero-Copy Data Transfer
Abstract: The performance of server side applications is becoming increasingly
important as more applications exploit the web application model. Extensive work
has been done to improve the performance of individual software components such
as web servers and programming language runtimes. This paper describes a novel
approach to boost web application performance by improving interprocess
communication between the programming language runtime and operating system.
The approach reduces redundant processing for memory copying and the context
switch overhead between users space and kernel space by exploiting the zero-copy
data transfer methodology such as the sendfile system call. In order to
transparently utilize this optimization feature with existing web applications, we
propose an enhancement of the PHP runtime, FastCGI protocol, and web
server. Our proposed approach achieves a 126% performance improvement with
micro-benchmarks, and a 22% performance improvement for the standard web
benchmark, SPECweb2005.
Jeffrey Erman, Alexandre Gerber, Oliver Spatscheck, Dan Pei and MohammadTaghi
Hajiaghayi. Network Aware Forward Caching
Abstract: This paper proposes and evaluates a Network Aware Forward
Caching approach for determining the optimal deployment strategy
of forward caches to a network. A key advantage of this approach
is that we can reduce the costs associated with forward caching to
maximize the benefit obtained from their deployment. We show in
our simulation that a 37% increase to net benefits could be achieved
over the standard method of full cache deployment to cache all
POPs traffic. In addition, we show that this maximal point occurs
when only 68% of the total traffic is cached.
Another contribution of this paper is the analysis we use to motive
and evaluate this problem. We characterize the Internet traffic
of 100K subscribers of a US residential broadband provider. We
use both layer 4 and layer 7 analysis to investigate the traffic volumes
of the flows as well as study the general characteristics of the
applications used. We show that HTTP is a dominant protocol and
account for 68% of the total downstream traffic and that 34% of that
traffic is multimedia. In addition, we show that multimedia content
using HTTP exhibits a 83% annualized growth rate and other HTTP
traffic has a 53% growth rate versus the 26% over all annual growth
rate of broadband traffic. This shows that HTTP traffic will become
ever more dominent and increase the potential caching opportunities.
Furthermore, we characterize the core backbone traffic of this
broadband provider to measure the efficiency content and traffic is
delivered. We find that CDN traffic is much more efficient than P2P
content and that there is large skew in the Air Miles between POP
in a typical network. Our findings show that there are many opportunties
in broadband provider networks to optimize how traffic is
delivered and cached.
Zakaria Al-Qudah, Seungjoon Lee, Michael Rabinovich, Oliver Spatscheck and
Kobus van der Merwe. Anycast-Aware Transport for Content Delivery Networks
Abstract: Anycast-based content delivery networks (CDNs) have many
properties that make them ideal for the large scale distribution of content on the
Internet. However, because routing
changes can result in a change of the endpoint that terminates the TCP session,
TCP session disruption remains a
concern for anycast CDNs, especially for large file down-loads. In this paper we
demonstrate that this problem does
not require any complex solutions. In particular, we present
the design of a simple, yet efficient, mechanism to handle session disruptions due
to endpoint changes. With our mechanism, a client can continue the download of
the content
from the point at which it was before the endpoint change.
Furthermore, CDN servers purge the TCP connection state
quickly to handle frequent switching with low system overhead.
We demonstrate experimentally the effectiveness of our
proposed mechanism and show that more complex mechanisms are not required.
with a reasonably high rate of endpoint switching, which is attractive
for load balancing scenarios. Moreover, our results show
that edge servers can purge TCP connection state after a
single timeout-triggered retransmission without any tangible
impact on ongoing connections. Besides improving server
performance, this behavior improves the resiliency of the
CDN to certain denial of service attacks.
Rich Media
Alberto Messina and Maurizio Montagnuolo. A Generalised Cross-Modal Clustering
Method Applied to Multimedia News Semantic Indexing and Retrieval
Abstract: The evolution of Web services has enabled the distribution of informative
content through dynamic media such as RSS feeds. In addition, the availability of
the same informative content in the form of digital multimedia data has
dramatically increased. Robust solutions for semantic aggregation of heterogeneous
data streams are needed to efficiently access desired information from this variety
of information sources. To this end, we present a novel approach for cross-media
information sources aggregation, and describe a prototype system implementing
this approach. The prototype adopts online newspaper articles and TV newscasts as
information sources, to deliver a service made up of items including both
contributions. Extensive experiments prove the effectiveness of the proposed
approach in a real-world business context.
Reinier H. van Leuken, Lluís Garcia, Ximena Olivares and Roelof van Zwol. Visual
diversification of image search results
Abstract: Due to the reliance on the textual information associated with an image,
image search engines on the Web lack the discriminative power to deliver visually
diverse search results. The textual descriptions are key to retrieve relevant results
for a given user query, but at the same time provide little information about the
rich image content.
In this paper we investigate three methods for visual diversification of image search
results. The methods deploy lightweight clustering techniques in combination with a
dynamic weighting function of the visual features, to best capture the discriminative
aspects of the resulting set of images that is retrieved. A representative image is
selected from each cluster, which together form a diverse result set.
Based on a performance evaluation we find that the outcome of the methods
closely resembles human perception of diversity, which was established in an
extensive clustering experiment carried out by human assessors.
Dong Liu, Meng Wang and Hong-Jiang Zhang. Tag Ranking
Abstract: Social media sharing web sites like Flickr allow users to annotate
images with free tags, which significantly facilitate
Web image search and organization. However, the tags associated
with an image generally are in a random order without
any importance or relevance information, which limits
the effectiveness of these tags in search and other applications.
In this paper, we propose a tag ranking scheme, aiming
to automatically rank the tags associated with a given
image according to their relevance to the image content.
We first estimate initial relevance scores for the tags based
on probability density estimation, and then perform a random
walk over a tag similarity graph to refine the relevance
scores. Experimental results on a 50, 000 Flickr photo collection
show that the proposed tag ranking method is both
effective and efficient. We also apply tag ranking into three
applications: (1) tag-based image search, (2) tag recommendation,
and (3) group recommendation, which demonstrates
that the proposed tag ranking approach really boosts the
performances of social-tagging related applications.
Lyndon Kennedy and Mor Naaman. Less Talk, More Rock: Automated Organization
of Community-Contributed Collections of Concert Videos
Abstract: We describe a system for synchronization and organization
of user-contributed content from live music events. We start
with a set of short video clips taken at a single event by
multiple contributors, who were using a varied set of cap-
ture devices. Using audio ngerprints, we synchronize these
clips such that overlapping clips can be displayed simulta-
neously. Furthermore, we use the timing and link structure
generated by the synchronization algorithm to improve the
ndability and representation of the event content, including
identifying key moments of interest and descriptive text for
important captured segments of the show. We also identify
the preferred audio track when multiple clips overlap. We
thus create a much improved representation of the event that
builds on the automatic content match. Our work demon-
strates important principles in the use of content analysis
techniques for social media content on the Web, and applies
those principles in the domain of live music capture.
Lei Wu, Linjun Yang, Nenghai YU and Xian-Sheng Hua. Learning to Tag
Abstract: Social tagging provides valuable and crucial information for
large-scale web image retrieval. It is ontology-free and easy to
obtain; however, noisy tags frequently appear, and users typically
will not tag all semantic objects in the image, which is also called
semantic loss. To avoid noises and compensate for the semantic loss,
tag recommendation is proposed in literature. However, current
recommendation simply ranks the related tags based on the single
modality of tag co-occurrence on the whole dataset, which ignores
other modalities, such as visual correlation. This paper proposes a
multi-modality recommendation based on both tag and visual
correlation, and formulates the tag recommendation as a learning
problem. Each modality is used to generate a ranking feature, and
Rankboost algorithm is applied to learn an optimal combination of
these ranking features from different modalities. Experiments on
Flickr data demonstrate the effectiveness of this learning-based
multi-modality recommendation strategy.
Search
Xing Yi, Hema Raghavan and Chris Leggetter. Discover Users’ Specific Geo
Intention in Web Search
Abstract: Discovering users’ specific and implicit geographic intention in web
search can greatly help satisfy users’ information needs. We build a geo intent
analysis system that learns a model from large scale web-search logs for this
discovery with minimal supervision. We build a city language model, which is a
probabilistic representation of the language surrounding the mention of a city in
web queries. We use several features derived from these language models to: (1)
identify users’ implicit geo intent and pinpoint the city corresponding to this intent
(2)determine whether the geo-intent is localized around the users’ current
geographic location. (3) predict cities for queries that have a mention of an entity
that is located in a specific place. Experimental results demonstrate the
effectiveness of using features derived from the city language model. We find that
(1) the system has over 90% precision and more than 74% accuracy for the task of
detecting users’ implicit city level geo intent. (2) the system achieves more than
96% accuracy in determining whether implicit geo queries are local geo queries,
neighbor region geo queries or none-of these (3) the city language model can
effectively retrieve cities in location-specific queries with high precision(88%) and
recall(74%); human evaluation results show that the language model predicts city
labels for location-specific queries with high accuracy (84.5%).
Xiangfu Meng, Z. M. Ma and Li Yan. Answering Approximate Queries over
Autonomous Web Databases
Abstract: To deal with the problem of empty or too little answers returned from a
Web database in response to a user query, this paper proposes a novel approach to
provide relevant and ranked query results. Based on the user original query, we
speculate how much the user cares about each specified attribute and assign a
corresponding weight to it. This original query is then rewritten as an approximate
query by relaxing the query criteria range. The relaxation order of all specified
attributes and the relaxed degree on each specified attribute are varied with the
attribute weights. For the approximate query results, we generate users’ contextual
preferences from database workload and use them to create a priori orders of
tuples in an off-line preprocessing step. Only a few representative orders are saved,
each corresponding to a set of contexts. Then, these orders and associated
contexts are used at query time to expeditiously provide ranked answers. Results of
a preliminary user study demonstrate that our query relaxation and results ranking
methods can capture the user’s preferences effectively. The efficiency and
effectiveness of our approach is also demonstrated by experimental result.
Huanhuan Cao, Daxin Jiang, Jian Pei, Enhong Chen and Hang Li. Towards Context-
Aware Search by Learning A Very Large Variable Length Hidden Markov Model from
Search Logs
Abstract: Capturing the context of a user’s query from the previous queries and
clicks in the same session leads to better understanding of the user’s information
need. A context-aware approach to document re-ranking, query suggestion, and
URL recommendation may improve users’ search experience substantially. In this
paper, we propose a general approach to context-aware search. To capture
contexts of queries, we learn a variable length Hidden Markov Model (vlHMM) from
search sessions extracted from log data. Although the mathematical model is
intuitive, how to learn a large vlHMM with millions of states from hundreds of
millions of search sessions poses a grand challenge. We develop a strategy for
parameter initialization in vlHMM learning which can greatly reduce the number of
parameters to be estimated in practice. We also devise a method for distributed
vlHMM learning under the map-reduce model. We test our approach on a real data
set consisting of 1.8 billion queries, 2.6 billion clicks, and 840 million search
sessions, and evaluate the effectiveness of the vlHMM learned from the real data on
three search applications: document re-ranking, query suggestion, and URL
recommendation. The experimental results clearly show that our context-
aware approach is both effective and efficient.
Eustache Diemert and Gilles Vandelle. Unsupervised Query Categorization using
Automatically-Built Concept Graphs
Abstract: Automatic categorization of user queries is an important component of
general purpose (Web) search engines, particularly for triggering rich, query-
that reduces dramatically the cost of setting up and maintaining such a categorizer,
while retaining good categorization power. The model is stored as a graph of
concepts where graph edges represent the cross-reference between the concepts.
Concepts and relations are extracted from query logs by an offline Web mining
process, which uses a search engine as a powerful summarizer for building a
concept graph. Empirical evaluation indicates that the system compares favorably
on publicly available data sets (such as KDD Cup 2005) as well as on portions of
the current query stream of Yahoo! Search, where it is already changing the
experience of millions of Web search users.
Jian Hu, gang wang, Fred Lochovsky and Zheng Chen. Understanding User’s Query
Intent with Wikipedia
Abstract: Understanding the intent behind a user’s query can help search engine
to automatically route the query to some corresponding vertical search engines to
obtain particularly relevant contents, thus, greatly improving user satisfaction.
There are three major challenges to the query intent classification problem: (1)
Intent representation; (2) Domain coverage and (3) Semantic interpretation.
Current approaches to predict the user’s intent mainly utilize machine learning
techniques. However, it is difficult and often requires much human efforts to meet
all these challenges by the statistical machine learning approaches. In this paper,
we propose a general methodology to the problem of query intent classification.
With very little human effort, our method can discover large quantities of intent
concepts by leveraging Wikipedia, one of the best human knowledge base. The
Wikipedia concepts are used as the intent representation space, thus, each intent
domain is represented as a set of Wikipedia articles and categories. The intent of
any input query is identified through mapping the query into the Wikipedia
representation space. Compared with previous approaches, our proposed method
can achieve much better coverage to classify queries in an intent domain even
through the number of seed intent examples is very small. Moreover, the method is
very general and can be easily applied to various intent domains. We demonstrate
the effectiveness of this method in three different applications, i.e., travel, job, and
person name. In each of the three cases, only a couple of seed intent queries are
provided. We perform the quantitative evaluations in comparison with two baseline
methods, and the experimental results shows that our method significantly
outperforms other methods in each intent domain.
Sreenivas Gollapudi and Aneesh Sharma. An Axiomatic Approach to Result
Diversification
Abstract: Understanding user intent is key to designing an
effective ranking system in a search engine. In the absence of any
explicit knowledge of user intent, search engines want to diversify
results to improve user satisfaction. In such a setting, the
probability ranking principle-based approach of presenting the most
relevant results on top can be sub-optimal, and hence the search
engine would like to trade-off relevance for diversity in the results.
In an analogy to prior work on ranking and clustering systems, we use
the axiomatic approach to characterize and design diversification
systems. We develop a set of natural axioms that a diversification
system is expected to satisfy, and show that no diversification
function can satisfy all the axioms simultaneously. We illustrate the
use of the axiomatic framework by providing three example
diversification objectives that satisfy different subsets of the
axioms. We also uncover a rich link to the facility dispersion problem
that results in algorithms for a number of diversification
objectives. Finally, we propose an evalutation methodology to
characterize the objectives and the underlying axioms. We conduct a
large scale evaluation of our objectives based on two data sets: a
data set derived from the Wikipedia disambiguation pages and a product
database.
Flavio Chierichetti, Ravi Kumar and Prabhakar Raghavan. Compressed web indexes
Abstract: Web search engines use indexes to efficiently retrieve pages containing
specified query terms, as well as pages linking to specified pages. The problem of
compressed indexes that permit such fast retrieval has a long history. We consider
the problem: assuming that the terms in (or links to) a page are generated from a