ALRA: Active Learning in Real-world Applications
Workshop ECML-PKDD 2012,
Friday, September 28, 2012, Bristol, UK
Workshop ECML-PKDD 2012, Friday, September 28, 2012, Bristol, UK
|Download Call For Paper as PDF /||Search for a hotel in Bristol on|
This workshop aims to offer a meeting opportunity for academics and industry-related researchers, belonging to the various communities of Computational Intelligence, Machine Learning, Experimental Design and Data Mining to discuss new areas of active learning, and to bridge the gap between data acquisition or experimentation and model building. How active sampling, incremental learning and data acquisition, can contribute towards the design and modeling of highly intelligent machine learning systems?
Machine learning indicates methods and algorithms which allow a model to learn a behavior thanks to examples. Active learning gathers methods which select examples used to build a training dataset for the predictive model. All the strategies aim to use a set of examples as small as possible and to select the most informative examples.
When designing active learning algorithms for real-world data, some specific issues are raised. The main ones are scalability and practicability. Methods must be able to handle high volumes of data, in spaces of possibly high-dimension, and the process for labeling new examples by an expert must be optimized.
We encouraged papers that describe applications of active learning in real-world. The industrial context, the main difficulties met and the original solution developed, had to be described. The following challenge has been conjointly organized on a practical application of active learning. The challenge and the results obtained will be presented.
As a search engine of places, Nomao collects data coming from multiple sources on the web and aggregates them. The deduplication process consists in detecting what data refer to the same place. To automate this process, using Machine Learning is well suited, and to optimize the creation of the training dataset, using Active Learning is appropriate.
However, in such a real-world case, labeling data is costly but large amounts of unlabeled data are available. So this raises specific issues: the main ones are scalability of the proposed method, representativity of the training dataset (e.g. learning when test and train inputs can have different distributions), and practicability of the labeling process (e.g. purchase of data labels by batches).
29,104 examples have already been labeled, each example being characterized by 120 features. This training dataset is available on the Nomao Challenge page, along with a test set of size 1,985 and a set of 100,000 unlabeled examples.
Starting on Friday, June 1, 2012, two active campaigns have been organized, each participant being allowed to ask for the labeling of a given number (100) of the unlabeled examples by an expert. And a test campaign has been carried out to evaluate the different approaches proposed, each participant being asked to label the examples provided on the test set, and their predictions being compared to the known true labels.
Papers that address this issue are welcome. Authors will thus contribute to the confrontation of proposed solutions and to discussions during the workshop. And author of the best results have received a free registration for the conference and workshop.
Topics of interest include
- Active Learning
- Experimental Design
- Incremental Learning
- On-line learning
- Case Studies of Active Learning
- First active campaign: Friday, June 1, 2012
- Second active campaign: Friday, June 8, 2012
- Final test campaign: Friday, June 15, 2012
- Paper submission deadline: Friday, June 29, 2012
- Paper acceptance notification: Tuesday, July 31, 2012
- Paper camera-ready deadline: Tuesday, August 14, 2012
- Workshop: Friday, September 28, 2012, Bristol, UK
9:00 - 10:00
|Keynote Talk [details]||Declarative modeling for machine learning and data mining||Luc De Raedt|
10:00 - 10:30
10:30 - 11:30
|Invited Talk [details]||Active Learning for Discovery in the Laboratory: Characterising Biomolecular Systems||Chris Lovell|
11:30 - 12:15
|Challenge presentation [details]||Design and Analysis of the Nomao challenge: Active Learning in the Real-World||Laurent Candillier and Vincent Lemaire|
12:15 - 12:45
|Challenge winner [details]||Batch-Mode Active Learning by Using Misclassified Data||Tengyu Sun and Jie Zhou|
12:45 - 13:15
|Position paper [details]||Programmer's Active Learning: A Broader Perspective of Choices for Real-World Classification Tasks that Matter||George Forman|
13:15 - 14:45
14:45 - 15:15
|Paper #5 [details]||Incorporating Density in Active Learning with Application to Ranking||Wenbin Cai and Ya Zhang|
15:15 - 15:45
|Paper #6 [details]||Active learning in the spatial-domain for landslide mapping in remote sensing images||Andre Stumpf, Nicolas Lachiche, Jean-Philippe Malet, Norman Kerle and Anne Puissant|
15:45 - 16:00
|Participants open discussion|
16:00 - 16:30
Active Learning for Discovery in the Laboratory: Characterising Biomolecular Systems
Abstract: Resource costs are a limiting factor in many real-world problems. For example, a chemist or biologist may only be able to afford a limited number of chemicals that can be used to perform experiments that examine a hypothesis. A domain expert may only have a limited amount time they can spend analysing a dataset. An inaccessible remote sensor or robotic exploratory system may only have a limited communications bandwidth through which data can be transmitted. In each case there is a strong requirement to make the best use of the resources available, to obtain as much information as possible. As such, active learning appears well suited to address the balance required between resource usage and information gain. However, real-world discovery problems often have factors or constraints upon them that may not currently be well addressed by active learning.
As an example of applying active learning to real-world problems, we consider laboratory based experimental response characterisation of biomolecular systems. In the process of scientific discovery, such characterisation is often the initial investigation performed to build a loose understanding of the behaviours that exist within a particular experiment parameter space. The aim is normally to navigate a large and potentially high dimensional parameter space as efficiently as possible, to determine if it contains any interesting behaviours that may warrant further in-depth investigation, or whether a new search should start elsewhere. As a consequence, the resources available to such a problem are often very small, particularly in relation to the size and scale of the parameter spaces explored. Additionally, biological experimentation is error prone, meaning that there is no guarantee an observation obtained in an experiment is representative of the true underlying behaviour. With deviations from the true value worse than standard experimental noise, such erroneous observations provide a significant problem for active learning, as their combination with extremely limited resources and in turn few previous observations, means there becomes a large amount of uncertainty within the problem. If an observation does not fit with the current belief of the behaviour of the system being investigated, the question has to be asked as to whether the observation is erroneous, or whether the current belief is incorrect. Whilst repeat experiments can be performed, performing too many repeat experiments will reduce already limited resources, leading to exploration being restricted and potentially features of the behaviour being missed. In machine learning terms, the problem is that of learning from an extremely small amount of data that may contain errors where the learner controls the data to obtain.
To address these problems, we took insight from how successful scientists go about facing these issues whilst making discoveries in the laboratory. Ideas within the philosophy of science in particular provide different ways of viewing and managing the problems, when compared to the more mathematically focussed views of active learning literature. This led to the development of initial generalised methods for addressing these problems, to produce a set of algorithms that combine ideas from philosophy of science and active learning, which are capable of effectively characterising biomolecular systems within the laboratory.
Design and Analysis of the Nomao challenge: Active Learning in the Real-World
Laurent Candillier and Vincent Lemaire
Abstract: Active Learning is an active area of research in the Machine Learning and Data Mining communities. In parallel, needs for efficient active learning methods are raised in real-world applications. As an illustration, we present in this paper an active learning challenge applied to a real-world application named Nomao. Nomao is a search engine of places. It aggregates information coming from multiple sources on the web to propose complete information related to a place. In this context, active learning is used to efficiently detect data that refer to a same place. The process is called data deduplication. Since it is a real-world application, some additional constraints have to be handled. The main ones are scalability of the proposed method, representativeness of the training dataset, and practicability of the labeling process.
Batch-Mode Active Learning by Using Misclassified Data
Tengyu Sun and Jie Zhou
Abstract: In this paper, we proposed a batch-mode active learning strategy which makes use of the misclassified data. The proposed algorithm first trains a classifier using the labeled data. Then in the active learning step, unlike many existing algorithms querying unlabeled samples close to the decision boundary, ours queries samples close to the data misclassified by the current classifier. In order to incorporate the diversity into the batch-mode querying, the proposed algorithm clusters the unlabeled samples near the misclassified data and queries the samples closest to each cluster center. Experimental results on real world datasets show that the proposed algorithm has a satisfying performance.
Programmer's Active Learning: A Broader Perspective of Choices for Real-World Classification Tasks that Matter
Abstract: This position paper opens the discussion about a future kind of active learning where, rather than just asking a domain expert to assign class labels to items, the system directs a proficient data mining programmer to perform a much wider variety of tasks, e.g. writing code to produce more predictive features for distinguishing confused classes, composing regular expressions to extract key-value features from technical text, writing a classification rule for some tight cluster of cases found by the system, or deciding whether the current classifier is satisficing, in view of its limited rate of improvement. Since data mining programmers are already involved in most efforts to develop classifiers for important real-world tasks, the benefits of channeling their talents to optimize their productivity are intriguing, as well as the potential for reducing the time-to-market for deploying an accurate classifier.
Incorporating Density in Active Learning with Application to Ranking
Wenbin Cai and Ya Zhang
Abstract: Active learning aims to achieve high performance using as few labeled training set as possible, thereby minimizing the cost of data labeling. In Web search ranking applications, learning to rank is an important task which is to automatically build a ranking function through supervised learning. Like many supervised learning tasks, a large amount of labeled training data is required to train a high quality ranking function. Meanwhile, in many real-world learning-to-rank applications, data labeling is usually very expensive and time-consuming. To reduce the labeling cost, there have been many studies on applying active learning to ranking, which aim to select the most informative example for labeling manually. However, existing works certainly ignore the information about prior data density which can be useful for active learning. In this paper, we use the classical Kernel Density Estimation (KDE) method to infer information about data density. Then, under the Generalization Error Reduction (GER) framework, we propose a novel active learning strategy to select the most informative example that minimizes the generalization error. The proposed strategy is applied at the query level, the document level, and further at query-document level with a two-stage active learning algorithm. Experimental results on a real-world Web search ranking dataset have demonstrated the effectiveness of the proposed active learning algorithms.
Active learning in the spatial-domain for landslide mapping in remote sensing images
Andre Stumpf, Nicolas Lachiche, Jean-Philippe Malet, Norman Kerle and Anne Puissant
Abstract: Active learning has recently been adopted to the reduce labeling costs in the supervised classification of remote sensing images. However, existing active learning approaches do not consider geographic space and yield a spatially dispersed sample distribution that still requires many individual surveys and is accompanied by relatively high annotation costs. We suggest a region-based active learning strategy that enables marker-based labeling of thousands of objects in relatively short time and speeds up the learning process. An application to the rapid mapping of landslides from very-high resolution satellite images is presented.
- Mahmoud Abou-Nasr (Ford Motor Company, USA)
- Cesare Alippi (Politecnico di Milano, Italia)
- Albert Bifet (University of Waikato, Hamilton, New Zealand)
- Zalan Bodo (Babes Bolyai University, Cluj-Napoca, Romania)
- Lehel Csato (Babes Bolyai University, Cluj-Napoca, Romania)
- Gideon Dror (Academic college of Tel-Aviv Yaffo, Israel)
- Hugo Jair Escalante (National Institute of Astrophysics, Optics and Electronics, Mexico)
- Matthieu Geist (IMS Research Group, Supelec, Metz, France)
- Liang Lan (Temple University, Philadelphia, USA)
- Chris Lovell (University of Southampton, UK)
- George Runger (Arizona State University, Tempe, AZ, USA)
- Burr Settles (Carnegie Mellon University, USA)
- Fabien Torre (INRIA, Lille 1 University, France)
- Ming-Hen Tsai (National Taiwan University)
- Ioannis Tsamardinos (University of Crete, Greece)
- Slobodan Vucetic (Temple University, Philadelphia, USA)
Submitted papers must be written in English and formatted according to the Springer LNAI guidelines. Instructions for authors and paper stylesheet files can be downloaded at: http://www.springer.de/comp/lncs/authors.html The maximum length of papers should not exceed 16 pages.
The papers have to be submitted via Easy Chair: http://www.easychair.org/conferences/?conf=alraecml2012
Papers are normally reviewed by three referees. The review process is single-blind (reviewer identities unknown to authors) and there is no opportunity for author rebuttal. This decision was made to minimize reviewer workload and to concentrate it in time, which may ultimately result in better review quality and decisions. If necessary, a discussion takes place among the reviewers of a paper until a decision is reached.