KDD Cup 历年资料


## KDD Cup 2020

### Regular Machine Learning Competition Track (ML Track 1)

比赛名称:Challenges for Modern E-Commerce Platform, Task 1 & Task 2

比赛链接:Task1, Task2

In ML Track 1 “Challenges for Modern E-Commerce Platform, sponsored by Alibaba, Alibaba DAMO Academy, Duke University, Tsinghua University, and UIUC, participants are invited to learn high-quality cross-modal representations by considering complex information of different types and the strong connection between modalities. The learned representation can be then used to compute the similarity score between the representations and select the images/videos that are related to the text. Finally, each submission will be evaluated on the testing dataset, which evaluates the correspondence between the retrieved products and the ground truth. This track has two tasks: Task 1 and Task 2.

### Regular Machine Learning Competition Track (ML Track 2)

比赛名称:Adversarial Attacks and Defense on Academic Graph

比赛链接:https://www.biendata.xyz/competition/kddcup_2020/

In the second Regular Track, “Adversarial Attacks and Defense on Academic Graph”, sponsored by BienData, requires participants to submit a modified version of the original dataset as a form of attack that should look similar to the original graph, but has lower classification accuracy under baseline models. It should be prepared and saved at the backend of the competition system. Then, all teams are required to submit an attacker and a defender. The organizers will match all attackers and defenders from all teams and rank the final leaderboard.

### Automated Machine Learning Competition Track (AutoML Track)

比赛名称:AutoML for Graph Representation Learning

比赛链接:https://www.4paradigm.com/competition/kddcup2020

In the AutoML Track “Automatic Graph Representation Learning (AutoGraph)”, provided by 4Paradigm, ChaLearn, Stanford University and Google, participants are invited to deploy AutoML solutions for graph representation learning, where node classification is chosen as the task to evaluate the quality of learned representations. Each team is given five public datasets to develop AutoML solutions. Five feedback datasets are provided to allow participants to evaluate their solutions. These solutions will be evaluated with five unseen datasets without human intervention, and the winners will be chosen based on the final rankings of the datasets.

### Reinforcement Learning Competition Track (RL Track)

比赛名称:Learning to Dispatch and Reposition on a Mobility-on-Demand Platform

比赛链接:https://www.biendata.xyz/competition/kdd_didi/

The RL Track “Learning to Dispatch and Reposition on a Mobility-on-Demand Platform”, sponsored by Didi Chuxing in collaboration with DiDi AI Labs, the largest ridesharing platform in the world, requires participants to apply machine learning tools to determine novel solutions for order dispatching (order matching) and vehicle repositioning (fleet management) on a Mobility-on-Demand (MoD) platform. Specifically, the competition looks at how machine learning solutions can be applied to improve the efficiency of MoD platform.


## KDD Cup 2019

### Regular Machine Learning Competition Track

赛题链接:https://dianshi.bce.baidu.com/competition/29/rule

Context-aware multi-modal transportation recommendation has a goal of recommending a travel plan which considers various unimodal transportation modes, such as walking, cycling, driving, public transit, and how to connect among these modes under various contexts. The successful development of multi-modal transportation recommendations can have a number of advantages, including but not limited to reducing transport times, balancing traffic flows, reducing traffic congestion, and ultimately, promoting the development of intelligent transportation systems.

Despite the popularity and frequent usage of transportation recommendation on navigation Apps (e.g., Baidu Maps and Google Maps), existing transportation recommendation solutions only consider routes in one transportation mode. Intuitively, in the context-aware multi-modal transportation recommendation problem, the transport mode preferences vary over different users and spatiotemporal contexts.

### Automated Machine Learning Competition Track

比赛链接:https://www.4paradigm.com/competition/kddcup2019

Temporal relational data is very common in industrial machine learning applications, such as online advertising, recommender systems, financial market analysis, medical treatment, fraud detection, etc. With timestamps to indicate the timings of events and multiple related tables to provide different perspectives, such data contains useful information that can be exploited to improve machine learning performance. However, currently, the exploitation of temporal relational data is often carried out by experienced human experts with in-depth domain knowledge in a labor-intensive trial-and-error manner.

In this challenge, participants are invited to develop AutoML solutions to binary classification problems for temporal relational data. The provided datasets are in the form of multiple related tables, with timestamped instances. Five public datasets (without labels in the testing part) are provided to the participants so that they can develop their AutoML solutions. Afterward, solutions will be evaluated with five unseen datasets without human intervention. The results of these five datasets determine the final ranking.

### “Research for Humanity” Reinforcement Learning Competition Track (Humanity RL Track)

比赛链接:https://compete.hexagon-ml.com/practice/rl_competition/37/

Malaria is thought to have had the greatest disease burden throughout human history, while it continues to pose a significant but disproportionate global health burden. With 50% of the world’s population at risk of malaria infection. Sub Saharan Africa is most affected, with 90% of all cases.

Through this KDD Cup|Humanity RL track competition we are looking for participants to apply machine learning tools to determine novel solutions which could impact malaria policy in Sub Saharan Africa. Specifically, how should combinations of interventions which control the transmission, prevalence and health outcomes of malaria infection, be distributed in a simulated human population.


## KDD Cup 2018

赛题名称:KDD Cup of Fresh Air

赛题链接:https://www.biendata.xyz/competition/kdd_2018/

For many large cities, air pollution has become a severe problem. This year’s KDD Cup, titled KDD Cup of Fresh Air, solicits machine learning solutions to accurately forecast air quality indices (AQIs) of the future 48 hours. Accurate predictions of AQIs can bring enormous value to governments, enterprises, and the general public - and help them make informed decisions.

KDD Cup of Fresh Air was launched on March 15th and ended on May 31st. Participants were asked to forecast the AQIs of Beijing, China and London, UK. Over 4,000 teams from 49 countries participated in the competition, and made over 20,000 submissions.


## KDD Cup 2017

赛题名称:Highway Tollgates Traffic Flow Prediction, Travel Time & Traffic Volume Prediction

赛题链接:https://tianchi.aliyun.com/competition/entrance/231597/information

Highway tollgates are well known bottlenecks in traffic networks. During rush hours, long queues at tollgates can overwhelm traffic management authorities. Effective preemptive countermeasures are desired to solve this challenge. Such countermeasures include expediting the toll collection process and streamlining future traffic flow. The expedition of toll collection could be simply allocating temporary toll collectors to open more lanes. Future traffic flow could be streamlined by adaptively tweaking traffic signals at upstream intersections. Preemptive countermeasures will only work when the traffic management authorities receive reliable predictions for future traffic flow. For example, if heavy traffic in the next hour is predicted, then traffic regulators could immediately deploy additional toll collectors and/or divert traffic at upstream intersections.

Traffic flow patterns vary due to different stochastic factors, such as weather conditions, holidays, time of the day, etc. The prediction of future traffic flow and ETA (Estimated Time of Arrival) is a known challenge. An unprecedented large amount of traffic data from mobile apps such as Waze (in the US) or Amap (in China) can help us take up that challenge. If the contestants in this proposed KDD CUP could design reliable approaches for future traffic flow and ETA prediction, then the traffic management authorities might be able to capitalize on big data & algorithms for fewer congestions at tollgates.


## KDD Cup 2016

赛题名称:Whose papers are accepted the most: towards measuring the impact of research institutions

赛题链接:https://kddcup2016.azurewebsites.net/

Finding influential nodes in a social network for identifying patterns or maximizing information diffusion has been an actively researched area with many practical applications. In addition to the obvious value to the advertising industry, the research community has long sought mechanisms to effectively disseminate new scientific discoveries and technological breakthroughs so as to advance our collective knowledge and elevate our civilization. For students, parents and funding agencies that are planning their academic pursuits or evaluating grant proposals, having an objective picture of the institutions in question is particularly essential. Partly against this backdrop we have witnessed that releasing a yearly Research Institution or University Ranking has become a tradition for many popular newspapers, magazines and academic institutes. Such rankings not only attract attention from governments, universities, students and parents, but also create debates on the scientific correctness behind the rankings. The most criticized aspect of these rankings is: the data used and the methodology employed for the ranking are mostly unknown to the public.

The 2016 KDD Cup will address this very important problem through publically available datasets, like the Microsoft Academic Graph (MAG), a freely available dataset that includes information on academic publications and citations. This dataset, being a heterogeneous graph, that can be used to study the influential nodes of various types including authors, affiliations and venues; we choose to focus on affiliations in this competition. In effect, given a research field, we are challenging the KDD Cup community to jointly develop data mining techniques to identify the best research institutions based on their publication and how they are cited in research articles.


## KDD Cup 2015

赛题名称:Predict dropout rate on MOOC platforms

Students' high dropout rate on MOOC platforms has been heavily criticized, and predicting their likelihood of dropout would be useful for maintaining and encouraging students' learning activities. Therefore, in KDD Cup 2015, we will predict dropout on XuetangX, one of the largest MOOC platforms in China.

The competition participants need to predict whether a user will drop a course within next 10 days based on his or her prior activities. If a user C leaves no records for course C in the log during the next 10 days, we define it as dropout from course C For more details about log, please refer to the Data Descriptions.


## KDD Cup 2014

赛题名称:Predict funding requests that deserve an A+

赛题链接:https://www.kaggle.com/c/kdd-cup-2014-predicting-excitement-at-donors-choose/

DonorsChoose.org is an online charity that makes it easy to help students in need through school donations. At any time, thousands of teachers in K-12 schools propose projects requesting materials to enhance the education of their students. When a project reaches its funding goal, they ship the materials to the school.

The 2014 KDD Cup asks participants to help DonorsChoose.org identify projects that are exceptionally exciting to the business, at the time of posting. While all projects on the site fulfill some kind of need, certain projects have a quality above and beyond what is typical. By identifying and recommending such projects early, they will improve funding outcomes, better the user experience, and help more students receive the materials they need to learn.

Successful predictions may require a broad range of analytical skills, from natural language processing on the need statements to data mining and classical supervised learning on the descriptive factors around each project.


## KDD Cup 2013

### Track 1

赛题名称:Determine whether an author has written a given paper

赛题链接:https://www.kaggle.com/c/kdd-cup-2013-author-paper-identification-challenge

The ability to search literature and collect/aggregate metrics around publications is a central tool for modern research. Both academic and industry researchers across hundreds of scientific disciplines, from astronomy to zoology, increasingly rely on search to understand what has been published and by whom.

Microsoft Academic Search is an open platform that provides a variety of metrics and experiences for the research community, in addition to literature search. It covers more than 50 million publications and over 19 million authors across a variety of domains, with updates added each week. One of the main challenges of providing this service is caused by author-name ambiguity. On one hand, there are many authors who publish under several variations of their own name. On the other hand, different authors might share a similar or even the same name.

### Track 2

赛题名称:Identify which authors correspond to the same person

赛题链接:https://www.kaggle.com/c/kdd-cup-2013-author-disambiguation

The ability to search literature and collect/aggregate metrics around publications is a central tool for modern research. Both academic and industry researchers across hundreds of scientific disciplines, from astronomy to zoology, increasingly rely on search to understand what has been published and by whom.

Microsoft Academic Search is an open platform that provides a variety of metrics and experiences for the research community, in addition to literature search. It covers more than 50 million publications and over 19 million authors across a variety of domains, with updates added each week. One of the main challenges of providing this service is caused by author-name ambiguity. This KDD Cup task challenges participants to determine which authors in a given data set are duplicates.


## KDD Cup 2012

### Track 1

赛题名称:Predict which users (or information sources) one user might follow in Tencent Weibo

赛题链接:https://www.kaggle.com/c/kddcup2012-track1/overview

Online social networking services have become tremendously popular in recent years, with popular social networking sites like Facebook, Twitter, and Tencent Weibo adding thousands of enthusiastic new users each day to their existing billions of actively engaged users. Since its launch in April 2010, Tencent Weibo, one of the largest micro-blogging websites in China, has become a major platform for building friendship and sharing interests online.

Currently, there are more than 200 million registered users on Tencent Weibo, generating over 40 million messages each day. This scale benefits the Tencent Weibo users but it can also flood users with huge volumes of information and hence puts them at risk of information overload. Reducing the risk of information overload is a priority for improving the user experience and it also presents opportunities for novel data mining solutions. Thus, capturing users’ interests and accordingly serving them with potentially interesting items (e.g. news, games, advertisements, products), is a fundamental and crucial feature social networking websites like Tencent Weibo.

### Track 2

赛题名称:Predict the click-through rate of ads given the query and user information

赛题链接:https://www.kaggle.com/c/kddcup2012-track2/

Search advertising has been one of the major revenue sources of the Internet industry for years. A key technology behind search advertising is to predict the click-through rate (pCTR) of ads, as the economic model behind search advertising requires pCTR values to rank ads and to price clicks. In this task, given the training instances derived from session logs of the Tencent proprietary search engine, soso.com, participants are expected to accurately predict the pCTR of ads in the testing instances.


## KDD Cup 2011

赛题名称:Predict music ratings and identify favorite songs

Learn the rhythm, predict the musical scores

People have been fascinated by music since the dawn of humanity. A wide variety of music genres and styles has evolved, reflecting diversity in personalities, cultures and age groups. It comes as no surprise that human tastes in music are remarkably diverse, as nicely exhibited by the famous quotation: "We don't like their sound, and guitar music is on the way out" (Decca Recording Co. rejecting the Beatles, 1962).

Yahoo! Music has amassed billions of user ratings for musical pieces. When properly analyzed, the raw ratings encode information on how songs are grouped, which hidden patterns link various albums, which artists complement each other, and above all, which songs users would like to listen to.

Such an exciting analysis introduces new scientific challenges. The KDD Cup contest releases over 300 million ratings performed by over 1 million anonymized users. The ratings are given to different types of items-songs, albums, artists, genres-all tied together within a known taxonomy.

Two Tracks, The competition is divided into two tracks:

  • The first track is aimed at predicting scores that users gave to various items.
  • The second track requires separation of loved songs from other songs.

Both tracks are open to all research groups in academia and industry.


## KDD Cup 2010

赛题名称:Student performance evaluation

How generally or narrowly do students learn? How quickly or slowly? Will the rate of improvement vary between students? What does it mean for one problem to be similar to another? It might depend on whether the knowledge required for one problem is the same as the knowledge required for another. But is it possible to infer the knowledge requirements of problems directly from student performance data, without human analysis of the tasks?

This year’s challenge asks you to predict student performance on mathematical problems from logs of student interaction with Intelligent Tutoring Systems. This task presents interesting technical challenges, has practical importance, and is scientifically interesting.


## KDD Cup 2009

赛题名称:Customer relationship prediction

Customer Relationship Management (CRM) is a key element of modern marketing strategies. The KDD Cup 2009 offers the opportunity to work on large marketing databases from the French Telecom company Orange to predict the propensity of customers to switch provider (churn), buy new products or services (appetency), or buy upgrades or add-ons proposed to them to make the sale more profitable (up-selling).

The most practical way, in a CRM system, to build knowledge on customer is to produce scores. A score (the output of a model) is an evaluation for all instances of a target variable to explain (i.e. churn, appetency or up-selling). Tools which produce scores allow to project, on a given population, quantifiable information. The score is computed using input variables which describe instances. Scores are then used by the information system (IS), for example, to personalize the customer relationship. An industrial customer analysis platform able to build prediction models with a very large number of input variables has been developed by Orange Labs. This platform implements several processing methods for instances and variables selection, prediction and indexation based on an efficient model combined with variable selection regularization and model averaging method. The main characteristic of this platform is its ability to scale on very large datasets with hundreds of thousands of instances and thousands of variables. The rapid and robust detection of the variables that have most contributed to the output prediction can be a key factor in a marketing application.

The challenge is to beat the in-house system developed by Orange Labs. It is an opportunity to prove that you can deal with a very large database, including heterogeneous noisy data (numerical and categorical variables), and unbalanced class distributions. Time efficiency is often a crucial point. Therefore part of the competition will be time-constrained to test the ability of the participants to deliver solutions quickly.


## KDD Cup 2008

赛题名称:Breast cancer

The KDD Cup 2008 challenge focuses on the problem of early detection of breast cancer from X-ray images of the breast. In a screening population, a small fraction of cancerous patients have more than one malignant lesion. To simplify the problem, we only consider one type of cancer - cancerous masses - and only include cancer patients with at most one cancerous mass per patient. The challenge will consist of two parts, each of which is related to the development of algorithms for Computer Aided Detection (CAD) of early stage breast cancer from X-ray images.


## KDD Cup 2007

赛题名称Consumer recommendations

This year's KDD Cup focuses on predicting aspects of movie rating behavior. There are two tasks. The tasks, developed in conjunction with Netflix, have been selected to be interesting to participants from both academia and industry You can choose to compete in either or both of the tasks.


## KDD Cup 2006

赛题名称:Pulmonary embolisms detection from image data

This year's KDD Cup challenge problem is drawn from the domain of medical data mining. The tasks are a series of Computer-Aided Detection problems revolving around the clinical problem of identifying pulmonary embolisms from three-dimensional computed tomography data. This challenging domain is characterized by:

  • Multiple instance learning
  • Non-IID examples
  • Nonlinear cost functions
  • Skewed class distributions
  • Noisy class labels
  • Sparse data

## KDD Cup 2005

赛题名称:Internet user search query categorization

This year's competition is about classifying internet user search queries. The task was specifically designed to draw participation from industry, academia, and students.


## KDD Cup 2004

赛题名称:Particle physics; plus protein homology prediction

This year's competition focuses on data-mining for a variety of performance criteria such as Accuracy, Squared Error, Cross Entropy, and ROC Area. As described on this WWW-site, there are two main tasks based on two datasets from the areas of bioinformatics and quantum physics.

We will use the program perf to measure the performance of the predictions you submit on the eight performance metrics. You do not need to use perf, but using perf will insure that the metrics you optimize to are defined the same way we will measure them.

  • For the Particle Physics Problem:

    • ACC: accuracy
    • ROC: area under the ROC curve (aka AUC)
    • CXE: cross-entropy
    • SLQ 0.01: Stanford Linear Accelerator Q-score (more on this later)
  • For the Protein Matching Problem:

    • TOP1: how often is a correct match (a homolog) ranked first
    • RMS: root-mean-squared-error (similar to optimizing squared error)
    • RKL: rank of the last matching case (rank of the last positive case)
    • APR: average precision

## KDD Cup 2003

赛题名称:Network mining and usage log analysis

This year's competition focuses on problems motivated by network mining and the analysis of usage logs. Complex networks have emerged as a central theme in data mining applications, appearing in domains that range from communication networks and the Web, to biological interaction networks, to social networks and homeland security. At the same time, the difficulty in obtaining complete and accurate representations of large networks has been an obstacle to research in this area.

This KDD Cup is based on a very large archive of research papers that provides an unusually comprehensive snapshot of a particular social network in action; in addition to the full text of research papers, it includes both explicit citation structure and (partial) data on the downloading of papers by users. It provides a framework for testing general network and usage mining techniques, which will be explored via four varied and interesting task. Each task is a separate competition with its own specific goals.

The first task involves predicting the future; contestants predict how many citations each paper will receive during the three months leading up to the KDD 2003 conference. For the second task, contestants must build a citation graph of a large subset of the archive from only the LaTex sources. In the third task, each paper's popularity will be estimated based on partial download logs. And the last task is open! Given the large amount of data, contestants can devise their own questions and the most interesting result is the winner.

### About the Data

The e-print arXiv, initiated in Aug 1991, has become the primary mode of research communication in multiple fields of physics, and some related disciplines. It currently contains over 225,000 full text articles and is growing at a rate of 40,000 new submissions per year. It provides nearly comprehensive coverage of large areas of physics, and serves as an on-line seminar system for those areas. It serves 10 million requests per month, including tens of thousands of search queries per day. Its collections are a unique resource for algorithmic experiments and model building. Usage data has been collected since 1991, including Web usage logs beginning in 1993. On average, the full text of each paper was downloaded over 300 times since 1996, and some were downloaded tens of thousands of times.

The Stanford Linear Accelerator Center SPIRES-HEP database has been comprehensively cataloguing the High Energy Particle Physics (HEP) literature online since 1974, and indexes more than 500,000 high-energy physics related articles including their full citation tree.


## KDD Cup 2002

赛题名称:BioMed document; plus gene role classification

This year the competition included two tasks that involved data mining in molecular biology domains. The first task focused on constructing models that can assist genome annotators by automatically extracting information from scientific articles. The second task focused on learning models that characterize the behavior of individual genes in a hidden experimental setting. Both are described in more detail on the Tasks page.


## KDD Cup 2001

赛题名称:Molecular bioactivity; plus protein locale prediction

Because of the rapid growth of interest in mining biological databases, KDD Cup 2001 was focused on data from genomics and drug design. Sufficient (yet concise) information was provided so that detailed domain knowledge was not a requirement for entry. A total of 136 groups participated to produce a total of 200 submitted predictions over the 3 tasks: 114 for Thrombin, 41 for Function, and 45 for Localization.


## KDD Cup 2000

赛题名称:Online retailer website clickstream analysis

The KDD Cup 2000 domain contains clickstream and purchase data from Gazelle.com, a legwear and legcare web retailer that closed their online store on 8/18/2000.


## KDD Cup 1999

赛题名称:Computer network intrusion detection

The task for the classifier learning contest organized in conjunction with the KDD'99 conference was to learn a predictive model (i.e. a classifier) capable of distinguishing between legitimate and illegitimate connections in a computer network.


## KDD Cup 1998

赛题名称:Direct marketing for lift curve optimization

The competition task is a regression problem where the goal is to estimate the return from a direct mailing in order to maximize donation profits.

数据集与赛题任务与KDD Cup 1997一样。


## KDD Cup 1997

赛题名称:Direct marketing for lift curve optimization

This year, for the first time, the KDD 1997 Organization is organizing a Knowledge Discovery and Data Mining competition (KDD CUP 1997) in conjunction with the 3rd International Conference on Knowledge Discovery and Data Mining (KDD 1997.)

The Cup is open to all KDDM tool vendors, academics and corporations with significant applications. All products, applications, research prototypes and black-box solutions are welcome. If requested, the anonymity of the participants and their affiliated companies / institutions will be preserved. Our aim is not to rank the participants but to recognize the most innovative, efficient and methodologically advanced KDDM tools.

This year's challenge is to predict who is most likely to donate to a charity. Contestants were evaluated on the accuracy on the validation data set.



© 2019-2023 coggle.club 版权所有     京ICP备20022947    京公网安备 11030102010643号