Qi He (何奇) Home Research Publication Software Technology Coding Life Miscellaneous

Seek Creative Essence in Truth

 

Awards

2007 Student Travel Grant, to present paper at SIGIR 2007 in Amsterdam, Netherlands

2007 Student Travel Grant, to present paper at SIAM on Data Mining 2007 in Minneapolis, Minnesota, U.S.A.

2006 Microsoft Research Fellowship, presented by Dr. Hsiao-Wuen Hon (managing director of MSRA) in person, was one of 2 awardees from Singapore (36 from Asia-Pacific)

Graduate Scholarship, to pursue Ph.D. program at Nanyang Technological University, January 2005- January 2008

2002 Lenovo Fellowship, for outstanding academic performance on M. Eng program

2001 IBM Best Student Award, honored among the best students in China

2001 Suntek Fellowship, for outstanding academic performance on M. Eng program

2000 Jinghua Fellowship, for outstanding academic performance on B. Eng program

Research Experiences

Microsoft Research Asia, Beijing

Internship research, September 2007 – February 2008

Analyze Query Sessions for Web Search Interactions

Finished work:

· Proposed a novel method using a Variable Memory Markov model for Web search query prediction. Query prediction accuracy significantly increases with the accumulation of query history (within the same query session). Due to the fact that the full length Markov models carry several theoretical and practical drawbacks for simulating user behaviors such as query sessions, a Variable Memory Markov (VMM) model is proposed. The VMM can efficiently capture most of the query temporal characteristics while retaining a much lower compression ratio (relative to the full length Markov models). Our method is especially well-suited to real-life scenarios in which the user has already issued 2 or more queries. The encouraging results pave the way for a wealth of end-to-end applications in Web search interaction like query suggestion, query expansion, and query substitution, etc. This work was submitted to SIGIR 2008.

Future work:

· Analyze user behaviors from query sessions in more details. During the above experiments, we found that the query sessions in fact contain too many unexpected noises. We have to aggregate billions of query sessions to achieve any statistically valid results. User behaviors may exhibit irregularities, e.g., a historical query could be frequently repeated. Poor data quality is a major obstacle to further studies. Therefore, the effective preprocessing and aggregation of search engine query logs remains a major challenge in the future.

· Study how to re-rank the search results via query session analysis. Results from query session analysis can be applied to facilitate Web search interactions, and user feedback could be used to help re-rank the search results. A more reliable ranking benchmark is based on user behaviors. For example, a query session could potentially help narrow the search scope with respect to a particular user’s past search patterns.

Nanyang Technological University, Singapore

PhD research, January 2005 – May 2008

Bursty Event Detection from Temporal Text Streams

· Detecting periodic and aperiodic events from news streams. Automatically extracting historical events from a news stream is an open research problem because the performance of traditional methods via document clustering largely relies on the granularity of similarity among news articles. Instead, we proposed to generate events based on bursty words. We found that bursty words have varying bursty periods, and it is more accurate and efficient to identify events based on bursty words occurring during the same bursty period. As a result, the bursty period of word led to the automatic discovery of bursty periodic and aperiodic events. This work was published in SIGIR 2007, etc.

· Analyzed the effect of bursty words on bursty event clustering. Document clustering has been a classical method for identifying topics/events from a news stream. However, previous research have neither considered detecting only important (bursty) events, nor incorporated bursty properties of words into the clustering. In this research, we considered embedding news documents into a Euclidean space consisting of only bursty words. The encouraging results showed that clustering in this new space enables important events to be quickly uncovered from a large-scaled news corpus. The work was published in SIAM Conference on Data Mining 2007, etc.

· Proposed a new research problem called Anticipatory Event Detection (AED). The vast majority of existing research on event detection only target historical events, which restricts its application to real-world problems. In practice, people expect to be notified of personally interesting events (impending) as early as possible. Especially in the financial sector, e.g., stock brokers are extremely sensitive to the timely notification of political events happening anywhere in the world. In AED, a user subscribes to interested topics by specifying a few keywords to describe his/her anticipated event. Whenever the event is triggered by our detector, a notification will be sent to the user. We trained the anticipatory event detector as a classifier using Support Vector Machines. Experiments verified the high precision of such a classifier for various types of user pre-defined events such as company Merger and Acquisition events. The work was published in ER 2006, etc.

Recent Hot Research Words

The following keywords are automatically detected by my own bursty word detection algorithm.

Word

Interval of burst

Word

Interval of burst

xml

2001 SIGMOD - 2006 SIGMOD

engine

2004 SIGMOD - 2006 VLDB

web

1999 SIGMOD - 2006 SIGMOD

sensor

2002 SIGMOD - 2006 VLDB

streams

2002 SIGMOD - 2006 SIGMOD

querying

2001 VLDB - 2006 VLDB

stream

2003 SIGMOD - 2006 VLDB

ranking

2004 VLDB - 2006 VLDB

over

2002 SIGMOD - 2006 VLDB

streaming

2002 SIGMOD - 2006 VLDB

services

2001 VLDB - 2006 VLDB

monitoring

2001 SIGMOD - 2006 VLDB

search

2002 SIGMOD - 2006 SIGMOD

xpath

2002 SIGMOD - 2006 VLDB

efficient

2005 VLDB -2006 VLDB

keyword

2002 SIGMOD - 2006 VLDB

continuous

2003 SIGMOD - 2006 VLDB

statistics

2003 VLDB - 2006 SIGMOD

xquery

2002 SIGMOD - 2006 SIGMOD

automatic

2003 VLDB - 2006 SIGMOD

matching

2002 SIGMOD - 2006 VLDB

adaptive

2000 SIGMOD - 2006 VLDB

approximate

1997 VLDB - 2006 VLDB

detection

2005 VLDB -2006 VLDB

networks

2004 SIGMOD - 2006 VLDB

peer-to-peer

2001 VLDB - 2006 VLDB

indexing

1999 VLDB - 2006 VLDB

top-k

2002 SIGMOD - 2006 VLDB

Note: 1) "xml 2001 SIGMOD - 2006 SIGMOD" means the xml-related research topics have been studied from 2001 SIGMOD to 2006 SIGMOD. 2) The time sequence is "1975 SIGMOD, 1975 VLDB... 2006 SIGMOD, 2006 VLDB". Therefore, "2004 SIGMOD - 2005 SIGMOD" covers three proceedings: 2004 SIGMOD, 2004 VLDB, and 2005 SIGMOD.

Top Conferences/Journals in Data Mining Related Fields

Conferences

Name

Coming deadline

SIGIR

Jan, 2009

SIGKDD

Feb, 2009

ICML

Feb, 2009

VLDB

Mar, 2009

ICDE

June, 2008

NIPS

June, 2008

SDM

Oct, 2008

WWW

Oct, 2008

Journals

Name

TOIS

TKDE

Professional Activities

Paper Reviewer

  • PAKDD (Pacific-Asia Conference on Knowledge Discovery and Data Mining), 2008

  • Infoscale (ICST Conference on Scalable Information Systems), 2007

  • CIKM (ACM Conference on Information and Knowledge Management), 2006

Mentor of NTU Research Peer Mentorship Program   May 2007-May 2008

One of 2 nominated Ph.D. students by School of Computer Engineering, NTU to participate in the Research Peer Mentorship Program. During the program, I was given the opportunity to recruit 2 undergraduates (Chong Wei Wah James and Vu Minh Tan) from 146 applicants to work with me in my research work. The program helped develop my supervisory and leadership skills while supervising and mentoring the undergraduates in research work.

Research Topics

This section introduces all my interested research topics over the past years. Each topic carries a simply yet clear brief (with examples, I try), following by its challenging/open problems recently (until my updates). Please never hesitate to contact me if you have unique comments or found any problems. Collaboration in research is always welcome.

Temporal Vector Space/Probability Model for Topical/Categorical Data