Rich Sutton

Nicolo Cesa-Bianchi, Gabor Lugosi

Volodya Vovk, Chris Watkins

Mark Herbster, Manfred Warmuth

Kenji Yamanishi

John Case, Sanjay Jain, Matthias Ott, Arun Sharma, Frank Stephan

Jochen Nessel

Andrew Mitchell

Lea Meyer

Meir Feder

Robert E. Schapire, Yoram Singer

Avrim Blum, Tom Mitchell

Banquet speaker (9:00 - 9:30): David Spiegelhalter

Stuart Russell

Claudio Gentile, David P. Helmbold

Philip M. Long

Philip M. Long

David Gamarnik

Andreas Birkendorf, Norbert Klasner, Christian Kuhlmann, Hans U. Simon

Roni Khardon

Irene Tsapara, Gyorgy Turan

Thomas R. Amoth, Paul Cull, Prasad Tadepalli

Sanjay Jain, Carl Smith, Rolf Wiehagen

R\"udiger Reischuk, Thomas Zeugmann

Ron Kohavi

Philip Dawid

Yoav Freund, Robert E. Schapire

Martin Anthony, Sean B. Holden

David A. McAllester

M. Miltrup, G. Schnitger

Yoav Freund

Christopher D. Rosin

Michael Kearns, Dana Ron

Baruch Awerbuch, Stephen Kobourov

Leslie G Valiant

Amos Beimel, Felix Geller, Eyal Kushilevitz

Organizers: Thomas Dietterich, David Heckerman, Michael Kearns (Mills Concert Hall, Humanities Bldg)

Title: On Sequential Prediction of Individual Sequences Relative to a Set of Experts Authors: Nicolo Cesa-Bianchi and Gabor Lugosi Email: cesabian@dsi.unimi.it, lugosi@upf.es Home_page: http://www.dsi.unimi.it/~cesabian, http://bonvent.upf.es/~lugosi Affiliation: University of Milan, Italy, Pompeu Fabra University, Spain Abstract: We investigate sequential randomized prediction of an arbitrary sequence taking values from a binary alphabet. The goal of the predictor is to minimize his Hamming loss relative to the loss of the best "expert" in a fixed, possibly infinite, set of experts. We point out a surprisingly close connection between the prediction problem and empirical process theory. Using ideas and results from empirical process theory, we show upper and lower bounds on the minimax relative loss in terms of the geometry of the class of experts. As a main example, we determine the exact order of magnitude of the minimax relative loss for the class of Markov experts. Furthermore, in the special case of static experts, we completely characterize the minimax relative loss in terms of the maximal deviation of an associated Rademacher process.

title: Universal Portfolio Selection authors: V.Vovk and C.Watkins e-mail: {vovk,chrisw}@dcs.rhbnc.ac.uk affiliation: Royal Holloway, University of London address: Department of Computer Science Egham, Surrey TW20 0EX, UK abstract: A typical problem in portfolio selection in stock markets is that it is not clear which of the many available strategies should be used. We apply a general algorithm of prediction with expert advice (the Aggregating Algorithm) to two different idealizations of the stock market. One is the well-known game introduced by Cover in connection with his ``universal portfolio'' algorithm; the other is a more realistic modification of Cover's game introduced in this paper, where market's participants are allowed to take ``short positions'', so that the algorithm may be applied to currency and futures markets. Besides applying the Aggregating Algorithm to a countable (or finite) family of arbitrary investment strategies, we also apply it, in the case of Cover's game, to the uncountable family of ``constant rebalanced portfolios'' considered by Cover. We generalize Cover's worst-case bounds for his ``universal portfolio'' algorithm (which can be regarded as a special case of the Aggregating Algorithm corresponding to learning rate~1) to the case of learning rates not exceeding~1. Finally, we discuss a general approach to designing investment strategies in which, instead of making statistical or other assumptions about the market, natural assumptions of computability are made about possible investment strategies; this approach leads to natural extensions of the notion of Kolmogorov complexity.

Title: Tracking the Best Regressor Authors: M. Herbster and M. Warmuth Email: mark@cs.ucsc.edu, manfred@cs.ucsc.edu Home_page: http://www.cse.ucsc.edu/~mark, http://www.cse.ucsc/edu/~manfred Affiliation: University of California at Santa Cruz Address: Computer Science Department, Applied Sciences, Santa Cruz, California 95064 (USA) Thanks_to: This research was supported by grant CCR-9700201 from the National Science Foundation. Abstract: In most of the on-line learning research the total on-line loss of the algorithm is compared to the total loss of the best off-line predictor $\bu$ from a comparison class of predictors. We call such bounds static bounds. The interesting feature of these bounds is that they hold for an arbitrary sequence of examples. Recently some work has been done where the comparison vector $\bu_t$ at each trial $t$ is allowed to change with time, and the total on-line loss of the algorithm is compared to the sum of the losses of $\bu_t$ at each trial plus the total ``cost'' for shifting to successive comparison vectors. This is to model situations in which the examples change over time and different predictors from the comparison class are best for different segments of the sequence of examples. We call such bounds shifting bounds. Shifting bounds still hold for arbitrary sequences of examples and also for arbitrary partitions. The algorithm does not know the off-line partition and the sequence of predictors that its performance is compared against. Naturally shifting bounds are much harder to prove. The only known bounds are for the case when the comparison class consists of a finite sets of experts or boolean disjunctions. In this paper we develop the methodology for lifting known static bounds to the shifting case. In particular we obtain bounds when the comparison class consists of linear neurons (linear combinations of experts). Our essential technique consists of the following. At the end of each trial we {\em project} the hypothesis of the static algorithm into a suitably chosen convex region. This keeps the hypothesis of the algorithm well-behaved and the static bounds can be converted to shifting bounds so that the cost for shifting remains reasonable.

Title: Minimax Relative Loss Analysis for Sequential Prediction Algorithms Using Parametric Hypotheses Author: Kenji Yamanishi Email: yamanisi@ccm.cl.nec.co.jp Affiliation: Theory NEC Laboratory, Real World Computing Partnership, C&C Media Research Laboratories, NEC Corporation, Address: 1-1, 4-chome, Miyazaki, Miyamae-ku, Kawasaki, Kanagawa 216, Japan. Abstract: We consider the problem of sequential prediction using a parametric hypothesis class of real-valued functions, e.g., linear predictors, piecewise polynomials, etc. We are specifically concerned with evaluating the minimax relative cumulative loss(RCL), which is defined as inf_{A}sup _{D}R(A:D). Here R(A:D) is the difference between the cumulative loss for an algorithm A w.r.t. a data sequence D and the minimum cumulative loss over the class w.r.t. D. The first contribution of this paper is to derive a non-asymptotic upper bound on the minimax RCL in a general decision-theoretic setting. It is given by (k/2B)\ln m +C where k is the number of parameters, m is the sample size, B and C do not depend on m. We explicitly identify B and C by relating them to the hypothesis class and the loss function in a general form. To derive this bound, we propose an algorithm AG2, of which the key idea is to discretize the continuous parameter space with optimal size and then to apply Vovk's aggregating technique. The second contribution of this paper is to derive tight lower bounds on the minimax RCL. We thereby show that AG2 is optimal in the sense that its upper bound on RCL matches the lower bound within o(\log m). The third contribution of this paper is to propose a distributed variant of AG2, denoted by DAG, by making the learning process parallel. We prove that under certain constraints for the loss function and the hypothesis class, DAG suffers the same predictive performance as AG2 while achieving a significant speed-up of learning over it.

Title: Robust Learning Aided by Context Authors: 1) John Case 2) Sanjay Jain 3) Matthias Ott 4) Arun Sharma 5) Frank Stephan Email: 1) case@cis.udel.edu 2) sanjay@iscs.nus.edu.sg 3) m_ott@ira.uka.de 4) arun@cse.unsw.edu.au 5) fstephan@math.uni-heidelberg.de Home_page: 1) http://www.cis.udel.edu/~case 2) http://www.iscs.nus.sg/~sanjay/ 3) http://i11www.ira.uka.de/~m_ott/ 4) http://www.cse.unsw.edu.au/~arun/ 5) http://math.uni-heidelberg.de:80/logic/fstephan/fstephan.html Affiliation: 1) University of Delaware 2) National University of Singapore 3) Universität Karlsruhe 4) University of New South Wales 5) Universität Heidelberg Address: 1) Department of CIS Newark, DE 19716, USA 2) Department of Information Systems and Computer Science Singapore 119260, Republic of Singapore 3) Institut für Logik, Komplexität und Deduktionssysteme 76128 Karlsruhe, Germany 4) School of Computer Science and Engineering Sydney 2052, Australia 5) Mathematisches Institut Im Neuenheimer Feld 294 69120 Heidelberg, Germany Thanks_to: This work was carried out while J. Case, S. Jain, M. Ott, and F. Stephan were visiting the School of Computer Science and Engineering at the University of New South Wales 3) Supported by the Deutsche Forschungsgemeinschaft (DFG) Graduiertenkolleg ``Beherrschbarkeit komplexer Systeme'' (GRK 209/2-96) 4) Supported by Australian Research Council Grant A49600456 5) Supported by the Deutsche Forschungsgemeinschaft (DFG) Grant Am 60/9-2 Abstract: Empirical studies of multitask learning provide some evidence that the performance of a learning system on its intended targets improves by presenting to the learning system related tasks, also called contexts, as additional input. Angluin, Gasarch, and Smith, as well as Kinber, Smith, Velauthapillai, and Wiehagen have provided mathematical justification for this phenomenon in the inductive inference framework. However, their proofs rely heavily on self-referential coding tricks, that is, they directly code the solution of the learning problem into the context. Fulk has shown that for the EX- and BC-anomaly hierarchies, such results, which rely on self-referential coding tricks, may not hold robustly. In this work we analyze robust versions of learning aided by context and show that --- in contrast to Fulk's result above --- the robust versions of these learning notions are still very powerful. Also, studied is the difficulty of the functional dependence between the intended target tasks and useful associated contexts.

Title: Birds can fly... Authors: Jochen Nessel Email: nessel@informatik.uni-kl.de Affiliation: University of Kaiserslautern Address: FB Informatik Postfach 3049 67653 Kaiserslautern Abstract: ...with a few exceptions. But in real life almost nobody pays much attention to these exceptions. What we really are interested in is a practicable rule. In Inductive Inference some measures of complexity are known, in which the functions of finite support are difficult to learn. But this class consists only of functions which are ``the constant zero function with a few exceptions.'' The following presents a uniform approach to modify three of these measures in a way that the functions of finite support become easy to learn, and tries to give intuitive support for the modified models. The modification is in essence a formal distinction between rules and exceptions and based on the assumption, that the complexity of a learning problem should be determined by the difficulty of learning the rules, and not by the difficulty of finding the exceptions.

Title: Learnability of a subclass of extended pattern languages Authors: Andrew R. Mitchell Email: andrewm@cse.unsw.edu.au Home_page: http://www.cse.unsw.edu.au/~andrewm Affiliation: University of New South Wales Address: Department of Artificial Intelligence, School of Computer Science and Engineering, University of New South Wales, Sydney 2052, Australia Thanks_to: This research was supported by an Australian Postgraduate Award and the Australian Research Council Grant A49600456 to A. Sharma Abstract: Angluin introduced the class of pattern languages as term languages over strings on a finite alphabet, and showed that the class is identifiable in the limit from texts positive data. In Angluin's definition of pattern languages, erasing substitutions are disallowed. Extended pattern languages are term languages over strings when erasing substitutions are allowed. The question of identifiability in the limit of extended pattern languages from texts has not been resolved, and is one of the outstanding open problems in inductive inference. However, there are at least two subclasses that are known to be identifiable. Shinohara has shown that the the class of extended pattern languages that can be generated by regular patterns (these are patterns in which each variable occurs no more than once) is identifiable in the limit from texts. Wright showed that the class of extended pattern languages where the pattern contains at most m distinct variables, for any positive integer m, is identifiable in the limit from texts. The present paper considers a generalisation of regular patterns where each variable occurring in the pattern occurs exactly $m$ times for some positive integer $m$. Such patterns are referred to as quasi-regular patterns, and for each m, the corresponding class of extended pattern languages is denoted QRP_m. It is shown that for each m, the class QRP_m is identifiable in the limit from texts. The proof employs a sophisticated combinatorial argument for bounding the search space. It is also shown that the class of the union of all QRP_m is identifiable in the limit from texts.

Title: Aspects of Complexity of Conservative Probabilistic Learning Authors: Lea Meyer Email: lea@modell.iig.uni-freiburg.de Home_page: http://www.iig.uni- freiburg.de/modell/users/lea/index.html Affiliation: Albert-Ludwigs-Universit=E4t Freiburg Address: Institut f=FCr Informatik und Gesellschaft Friedrichstr. 50 79098 Freiburg Abstract: When dealing with probabilistic learning of indexed families under=20 monotonicity constraints, we receive high structured probabilistic hierarchies. In particular, conservative probabilistic learning is more powerful than conservative deterministic learning even if the probability is close to 1 provided the learning machines are claimed to work with respect to proper or class preserving hypothesis spaces. The additional learning power of probabilistic learning machines can be compensated when we allow the machines to ask ``questions'' to an oracle. In this paper, we show that conservative probabilistic learning is less powerful than conservative learning with K-oracle provided the probability p is greater than 1/2. However, K is necessary, i.e., for each p>1/2 exists a learning problem which is conservatively learnable by a K-oracle machine but not by an oracle machine having access to a oracle A which has a lower Turing degree than K. The main result is that for each oracle A weaker then the halting problem K, there exists an indexed family L(A) which is conservatively identifiable with p=3D1/2, and which exactly reflects the Turing degree of A, i.e., L(A) is conservatively identifiable by an oracle machine M[B] iff the Turing degree of A is lower or equal than the Turing degree of B. However, not every indexed family which is conservatively identifiable with probability p=3D 1/2 reflects the Turing degree of an oracle. Hence, conservative probabilistic learning with p=3D1/2 is too ``rich'' to be measured in terms of Turing-Complexity.

Title: Learning to Communicate via Unknown Channel Authors: Meir Feder Email: meir@eng.tau.ac.il Affiliation: Electrical Eng. dept., Tel-Aviv University

Title: Improved Boosting Algorithms Using Confidence-rated Predictions Authors: Robert E. Schapire and Yoram Singer Email: schapire@research.att.com, singer@research.att.com Home_page: http://www.research.att.com/~schapire http://www.research.att.com/~singer Affiliation: AT&T Labs Address: 180 Park Avenue Florham Park, NJ 07932 Abstract: We describe several improvements to Freund and Schapire's AdaBoost boosting algorithm, particularly in a setting in which hypotheses may assign confidences to each of their predictions. We give a simplified analysis of AdaBoost in this setting, and we show how this analysis can be used to find improved parameter settings as well as a refined criterion for training weak hypotheses. We give a specific method for assigning confidences to the predictions of decision trees, a method closely related to one used by Quinlan. This method also suggests a technique for growing decision trees which turns out to be identical to one proposed by Kearns and Mansour. We focus next on how to apply the new boosting algorithms to multiclass classification problems, particularly to the multi-label case in which each example may belong to more than one class. We give two boosting methods for this problem. One of these leads to a new method for handling the single-label case which is simpler but as effective as techniques suggested by Freund and Schapire. Finally, we give some experimental results comparing a few of the algorithms discussed in this paper.

Title: Combining Labeled and Unlabeled Data with Co-Training Authors: Avrim Blum and Tom Mitchell Email: avrim+@cs.cmu.edu, mitchell+@cs.cmu.edu Home_page: http://www.cs.cmu.edu/~avrim http://www.cs.cmu.edu/~tom Affiliation: Carnegie Mellon University Address: School of Computer Science, Carnegie Mellon University, Pittsburgh PA 15213-3891 Thanks_to: This research was supported in part by the DARPA HPKB program under contract F30602-97-1-0215 and by NSF National Young Investigator grant CCR-9357793 Abstract: We consider the problem of using a large unlabeled sample to boost performance of a learning algorithm when only a small set of labeled examples is available. In particular, we consider a problem setting motivated by the task of learning to classify web pages, in which the description of each example can be partitioned into two distinct views. For example, the description of a web page can be partitioned into the words occurring on that page, and the words occurring in hyperlinks that point to that page. We assume that either view of the example would be sufficient for learning if we had enough labeled data, but our goal is to use both views together to allow inexpensiveunlabeleddata to augment a much smaller set of labeled examples. Specifically, the presence of two distinct views of each example suggests strategies in which two learning algorithms are trained separately on each view, and then each algorithm's predictions on new unlabeled examples are used to enlarge the training set of the other. Our goal in this paper is to provide a PAC-style analysis for this setting, and, more broadly, a PAC-style framework for the general problem of learning from both labeled and unlabeled data. We also provide empirical results on real web-page data indicating that this use of unlabeled examples can lead to significant improvement of hypotheses in practice.

Title: Learning agents for uncertain environments Authors: Stuart Russell Email: russell@cs.berkeley.edu Affiliation: Computer Science Dept., University of California, Berkeley.

Title: Improved Lower Bounds for Learning from Noisy Examples: an Information-Theoretic Approach Authors: Claudio Gentile, David P. Helmbold Email: gentile@dsi.unimi.it, dph@cse.ucsc.edu Affiliation: Universita' di Milano, University of California Santa Cruz Address: DSI, Via Comelico 39, 20135 Milano, Italy, Computer Science Department, 95064 Santa Cruz, USA Thanks_to: The second author is supported by NSF grant CCR 9700201 Abstract: This paper presents a general information-theoretic approach for obtaining lower bounds on the number of examples needed to PAC learn in the presence of noise. This approach deals directly with the fundamental information quantities, avoiding a Bayesian analysis. The technique is applied to several different models, illustrating its generality and power. The resulting bounds add logarithmic factors to (or improve the constants in) previously known lower bounds.

Title: The complexity of learning according to two models of a drifting environment Authors: P. M. Long Email: plong@iscs.nus.edu.sg Home_page: http://www.iscs.nus.edu.sg/~plong Affiliation: National University of Singapore Address: ISCS Department National University of Singapore Singapore 119260, Republic of Singapore Abstract: We establish new sufficient conditions for learning in a drifting environment for both the agnostic and the realizable cases. Both match known necessary conditions to within constant factors. We provide a simpler proof of a bound on the sample complexity of agnostic learning in a fixed environment that is within a constant factor of the best possible.

Title: On the sample complexity of learning functions with bounded variation Authors: P. M. Long Email: plong@iscs.nus.edu.sg Home_page: http://www.iscs.nus.edu.sg/~plong Affiliation: National University of Singapore Address: ISCS Department National University of Singapore Singapore 119260, Republic of Singapore Thanks_to: This research was supported by National University of Singapore Academic Research Fund Grant RP960625. Abstract: We show that the class BV of [0,1]-valued functions with total variation at most 1 can be agnostically learned with respect to the absolute loss in polynomial time from O((1/epsilon)^2 log (1/delta)) examples, matching a known lower bound to within a constant factor. We establish a bound of O(1/m) on the expected error of a polynomial-time algorithm for learning BV in the prediction model, also matching a known lower bound to within a constant factor. Applying a known algorithm transformation to our prediction algorithm, we obtain a polynomial-time PAC learning algorithm for BV with a sample complexity bound of O((1/epsilon)log(1/delta)); this also matches a known lower bound to within a constant factor.

Title: Efficient Learning of Monotone Concepts via Quadratic Optimization Authors: David Gamarnik Email: gamarnik@watson.ibm.com Affiliation: IBM, T. J. Watson Research Center Address: IBM, T. J. Watson Research Center P.O. Box 218, Yorktown Heights, NY 10598

Title: Structural Results about Exact Learning with Unspecified Attribute Values Authors: Andreas Birkendorf, Norbert Klasner, Christian Kuhlmann, and Hans U. Simon Email: birkendo@lmi.ruhr-uni-bochum.de, klasner@lmi.ruhr-uni-bochum.de, kuhlmann@lmi.ruhr-uni-bochum.de, simon@lmi.ruhr-uni-bochum.de Home_page: http://www.lmi.ruhr-uni-bochum.de/birkendorf/, http://www.lmi.ruhr-uni-bochum.de/klasner/, http://www.lmi.ruhr-uni-bochum.de/kuhlmann/, http://www.lmi.ruhr-uni-bochum.de/simon/ Affiliation: Ruhr-Universitaet Bochum Address: Lehrstuhl Mathematik und Informatik,, Fakultaet fuer Mathematik, D-44780 Bochum, Germany Thanks_to: The authors gratefully acknowledge the support of Deutsche Forschungsgemeinschaft grant Si 498/3-2 and the support of the German-Israeli Foundation for Scientific Research and Development grant I-403-001.06/95. Abstract: This paper deals with the UAV learning model of Goldman, Kwek and Scott [GoldmanKS97], where ``UAV'' is the acronym for ``Unspecified Attribute Values''. As in [GoldmanKS97], we consider exact learning within the UAV framework. A smooth transition between exact learning in the UAV setting and standard exact learning is obtained by putting a fixed bound r on the number of unspecified attribute values per instance. For r=0, we obtain the standard model. For r=n (the total number of attributes), we obtain the (unrestricted) UAV model. Between these extremes, we find the hierarchies (UAV-MQ_r)_{0 <= r <= n}, (UAV-EQ_r)_{0 <= r <= n}, and (UAV-ARB-EQ_r)_{0 <= r <= n}. Our main results are as follows. We present various lower bounds on the number of ARB-EQs and UAV-MQs in terms of the Vapnik Chervonenkis dimension of the concept class. We show furthermore that a natural extension of Angluin's Sunflower Lemma [Angluin88] is still applicable in the exact UAV learning model. Our UAV Sunflower Lemma allows to establish exponentially large lower bounds on the necessary number of UAV-MQs for several popular concept classes. On the other hand, we can show that slight simplifications of these classes are efficiently learnable using only few UAV-MQs. Finally, we investigate the inherent structure of the aforementioned three hierarchies and the relations between them. It turns out that query type UAV-EQ_{r-1} is strictly stronger than UAV-EQ_r (for each constant r). The analogous result for UAV-ARB-EQ is valid. Furthermore, UAV-MQ_{r+omega(log n)} is strictly stronger than UAV-MQ_r. We also determine the relation between query types chosen from different hierarchies.

Title: Learning First Order Universal Horn Expressions Authors: R. Khardon Email: roni@dcs.ed.ac.uk Home_page: http://www.dcs.ed.ac.uk/home/roni/ Affiliation: University of Edinburgh Address: Department of Computer Science University of Edinburgh The King's Buildings Edinburgh EH9 3JZ Scotland Thanks_to: Part of this work was done while the author was at Harvard University and supported by ARO grant DAAL03-92-G-0115 and ONR grant N00014-95-1-0550. Abstract: The problem of learning universally quantified function free first order Horn expressions from equivalence and membership queries is studied. The results presented generalise the known algorithm for propositional Horn expressions for several models of learning in first order logic. It is shown that exact learning is possible with membership and equivalence queries with resources polynomial in the number of clauses in the expression, though superpolynomial in the number of universally quantified variables. Similar results for related models including entailment queries and ILP are also derived.

Title: Learning atomic formulas with prescribed properties Authors: I.Tsapara and G.Turan Email: rena@itisolutions.com, gyt@uic.edu Home_page: http://www.math.uic.edu/~tsapara, http://www.math.uic.edu/~turan Affiliation: University of Illinois at Chicago G. Turan is also affiliated with Artificial Intelligence Research Group of the Hungarian Academy of Sciences, Szeged, Hungary. I.Tsapara's current affiliation "Investment Technologies International, Chicago, IL" Address: Department of MSCS, 851 S.Morgan, M/C 249 Chicago, IL, 60607-7045, USA Thanks_to: Partially supported by grants ESPRIT 20237 and OTKA T-016349. Abstract: We consider the learnability of some concept classes in predicate logic with proper equivalence queries. Concepts are represented by atomic formulas over a restricted language. The concept represented by an atomic formula consists of its ground instances having bounded depth. In addition, it is assumed that there is a first-order sentence given, and the concept class contains only those atomic formulas which satisfy this sentence. For instance, one may consider the learnability of atomic formulas that represent symmetric, or transitive concepts. It is shown that every such concept class can be learned with O(m^2 + m log n) queries, where m is the quantifier rank of the sentence and n is the depth bound for the ground instances. The proof uses a combination of tools from logic and learning algorithms for geometric concepts. Model theoretic games are used to determine the structure of the concept classes. We formulate a constrained version of the problem of learning two-dimensional axis-parallel rectangles, where one corner is required to belong to a prespecified subset. A sufficient condition is given for the efficient learnability of a rectangle in terms of the geometric properties of this subset. The algorithm for the predicate logic learning problem is obtained by combining the learning algorithm for the constrained rectangle learning problem with the information given by the logical analysis of the structure of the concept class.

Title: Exact Learning of Tree Patterns from Queries and Counterexamples Authors: Thomas R. Amoth and Paul Cull and Prasad Tadepalli Email: {amotht,pc,tadepall}@cs.orst.edu Home_page: http://www.cs.orst.edu/~amotht Affiliation: Oregon State University Address: Department of Computer Science Oregon State University Corvallis, OR 97331 Thanks_to: This research was partially supported by NSF under grant number IRI-9520243. Abstract: We consider learning tree patterns from queries. The instances are ordered and unordered trees with nodes labeled by constant identifiers. The concepts are tree patterns and unions of tree patterns (forests) where all the internal nodes are labeled with constants and the leaves are labeled with constants or variables. A tree pattern matches any tree with its variables replaced with constant subtrees. We show that ordered trees, in which the children are matched in a strict left-to-right order, are exactly learnable from equivalence queries, while ordered forests are learnable from equivalence and membership queries. Unordered trees are exactly learnable from superset queries, and unordered forests are learnable from superset and equivalence queries. Negatively, we also show that each of the query types used is necessary for learning each concept class.

Title: On the Power of Learning Robustly Authors: Sanjay Jain, Carl Smith and Rolf Wiehagen Email: sanjay@iscs.nus.edu.sg, smith@cs.umd.edu, wiehagen@informatik.uni-kl.de Home_page: (1) http://www.iscs.nus.edu.sg/~sanjay, (2) http://www.cs.umd.edu/~smith, (3) http://www-agrw.informatik.uni-kl.de Affiliation: (1) National University of Singapore, (2) University of Maryland, (3) University of Kaiserslautern Address: (1) Department of ISCS, National University of Singapore, Singapore 119260 (2) Department of Computer Science, University of Maryland, College Park, MD 20742, USA (3) Department of Computer Science, University of Kaiserslautern, D-67653 Kaiserslautern, Germany Thanks_to: Abstract: Intuitively, a class of objects is robustly learnable if not only this class itself is learnable but all of its computable transformations do remain learnable as well. In that sense, being learnable robustly seems to be a desirable property in all fields of learning. We will study this phenomenon within the paradigm of inductive inference. Here a class of recursive functions is called robustly learnable under a success criterion I iff all of its images under general recursive operators are learnable under the criterion I. Fulk showed the existence of a non-trivial class which is robustly learnable under the criterion Ex. However, several of the hierarchies (such as the anomaly hierarchies for Ex and Bc) do not stand robustly. Hence, up to now it was not clear if robust learning is really rich. The main intention of this paper is to give strong evidence that robust learning is rich. Our main result proved by a priority construction is that the mind change hierarchy for Ex stands robustly. Moreover, the hierarchies of team learning for both Ex and Bc stand robustly as well. In several contexts, we observe the surprising fact that a more complex topological structure of the classes to be learned leads to positive robustness results, whereas an easy topological structure yields negative results. We also show the counter-intuitive fact that even some self-referential classes can be learned robustly. Some of our results point out the difficulty of robust learning when only a bounded number of mind changes is allowed. Further results concerning uniformly robust learning are summarized.

Title: Learning One-Variable Pattern Languages in Linear Average Time Authors: R. Reischuk and T. Zeugmann Email: reischuk@informatik.mu-luebeck.de thomas@i.kyushu-u.ac.jp Home_page: http://www.itheoi.mu-luebeck.de/pages/reischuk/ http://www.i.kyushu-u.ac.jp/~thomas/index.html Affiliation: Medical University Luebeck Kyushu University Address: Institut fuer Theoretische Informatik Medizinische Universitaet Luebeck Wallstr. 40 23560 Luebeck, Germany Department of Informatics Kyushu University Kasuga 816-8580, Japan Thanks_to: This research has been supported by the Japan Society for the Promotion of Science under Grant JSPS 29716102 Abstract: A new algorithm for learning one-variable pattern languages is proposed and analyzed with respect to its average-case behavior. We consider the total learning time that takes into account all operations till an algorithm has converged to a correct hypothesis. For the expectation it is shown that for almost all meaningful distributions defining how the pattern variable is replaced by a string to generate random examples of the target pattern language this algorithm converges within a constant number of rounds with a total learning time that is linear in the pattern length. Thus, the algorithm is average-case optimal in a strong sense. Though one-variable pattern languages cannot be inferred finitely, our approach can also be considered as probabilistic finite learning with high confidence.

Title: Large Margin Classification Using the Perceptron Algorithm Authors: Yoav Freund and Robert E. Schapire Email: yoav@research.att.com, schapire@research.att.com Home_page: http://www.research.att.com/~yoav http://www.research.att.com/~schapire Affiliation: AT&T Labs Address: 180 Park Avenue Florham Park, NJ 07932 Abstract: We introduce and analyze a new algorithm for linear classification which combines Rosenblatt's perceptron algorithm with Helmbold and Warmuth's leave-one-out method. Like Vapnik's maximal-margin classifier, our algorithm takes advantage of data that are linearly separable with large margins. Compared to Vapnik's algorithm, however, ours is much simpler to implement, and much more efficient in terms of computation time. We also show that our algorithm can be efficiently used in very high dimensional spaces using kernel functions. We performed some experiments using our algorithm, and some variants of it, for classifying images of handwritten digits. The performance of our algorithm is close to, but not as good as, the performance of maximal-margin classifiers on the same problem.

Title: Cross-Validation for Binary Classification by Real-Valued Functions: Theoretical Analysis Authors: Martin Anthony and Sean B. Holden Email: anthony@vax.lse.ac.uk, s.holden@cs.ucl.ac.uk Home_page: http://www.cs.ucl.ac.uk/staff/S.Holden/ Affiliation: London School of Economics Address: Department of Mathematics, London School of Economics, Houghton Street, London WC2A 2AE, U.K. Affiliation: University College London Address: Department of Computer Science, University College London, Gower Street, London WC1E 6BT, U.K. Thanks_to: Some of this research was carried out while the authors were guests of the Isaac Newton Institute for Mathematical Sciences, Cambridge University. Abstract: This paper concerns the use of real-valued functions for binary classification problems. Previous work in this area has concentrated on using as an error estimate the `resubstitution' error (that is, the empirical error of a classifier on the training sample) or its derivatives. However, in practice, cross-validation and related techniques are more popular. Here, we analyse theoretically the accuracy of the holdout and cross-validation estimators for the case where real-valued functions are used as classifiers. We then introduce two new error estimation techniques, which we call the {\em adaptive holdout estimate\/} and the {\em adaptive cross-validation estimate\/}, and we perform a similar analysis for these. Finally, we show how our results can be applied to certain types of neural network.

Title: Some PAC-Bayesian theorems Authors: David McAllester Email: dmac@research.att.com Home_page: http://www.research.att.com/~dmac Affiliation: AT&T Labs-Research Address: P.O. Box 971 180 Park Avenue Florham Park, NJ 07932-0971 Abstract: This paper gives PAC guarantees for ``Bayesian'' algorithms --- algorithms that optimize risk minimization expressions involving a prior probability and a likelihood for the training data. PAC-Bayesian algorithms are motivated by a desire to provide an informative prior encoding information about the expected experimental setting but still having PAC performance guarantees over all IID settings. The PAC-Bayesian theorems given here apply to an arbitrary prior measure on an arbitrary concept space. These theorems provide an alternative to the use of VC dimension in proving PAC bounds for parameterized concepts.

Title: Neural Networks and Efficient Associative Memory Authors: M. Miltrup and G. Schnitger Email: miltrup@thi.informatik.uni-frankfurt.de, georg@informatik.uni-frankfurt.de, Home_page: http://www.informatik.uni-frankfurt.de/~foued/AG.html Affiliation: University of Frankfurt Address: Computer Science Department, Postfach 11 19 32 60054 Frankfurt, Germany Thanks_to: Partially supported by DFG project Schn 503/1-2 Abstract: We consider the familiarity problem: for given patterns n-bit patterns x_1,..,x_m and n-bit query y one has to decide whether y is sufficiently close to one of the patterns with respect to Hamming distance. Errors are allowed and inputs of intermediate distance are excluded, when evaluating arbitrary memory models for the familiarity problem. We design neural circuits for various distributions and show that the memory complexity of our circuits (i.e., the logarithm of the total number of required circuits) is close to optimal, even when compared to arbitrary memory models. Moreover we obtain near-optimal results simultaneously for the resources memory complexity, depth and number of edges. Additionally the number of vertices can be minimized in a somewhat restricted setting. Circuits for the familiarity problem are considerably smaller in terms of memory complexity, number of edges or number of gates when compared to circuits for the conventional Hamming distance problem.

Title: Self bounding learning algorithms Authors: Yoav Freund Email: yoav@research.att.com Home_page: http://www.research.att.com/~yoav Affiliation: AT&T Labs - Research Address: AT&T Labs - Research 180 Park ave, Room A205 Florham Park, NJ 07932-0971 Abstract: Most of the work which attempts to give bounds on the generalization error of the hypothesis generated by a learning algorithm is based on methods from the theory of uniform convergence. These bounds are a-priori bounds that hold for any distribution of examples and are calculated before any data is observed. In this paper we propose a different approach for bounding the generalization error after the data has been observed. A self-bounding learning algorithm is an algorithm which, in addition to the hypothesis that it outputs, outputs a reliable upper bound on the generalization error of this hypothesis. We first explore the idea in the statistical query learning framework of Kearns. After that we give an explicit self bounding algorithm for learning algorithms that are based on local search.

Title: Sample Complexity of Model-based Search Authors: Christopher D. Rosin Email: crosin@scripps.edu Home_page: http://www.cse.ucsd.edu/users/crosin Affiliation: The Scripps Research Institute and UCSD CSE Dept. Address: The Scripps Research Institute, Mail Drop MB5 10550 North Torrey Pines Road La Jolla, CA 92037 Thanks_to: This work was supported by Burroughs Wellcome LJIS grant APP #0842. Abstract: We consider the problem of searching a domain for points that have a desired property, in the special case where the objective function that determines the properties of points is unknown and must be learned during search. We give a parallel to PAC learning theory that is appropriate for reasoning about the sample complexity of this problem. The learner queries the true objective function at selected points, and uses this information to choose models of the objective function from a given hypothesis class that is known to contain a correct model. These models are used to focus the search on more promising areas of the domain. The goal is to find a point with the desired property in a small number of queries. We define an analog to VC dimension, needle dimension, to be the size of the largest sample in which any single point could have the desired property without the other points' values revealing this information. We give an upper bound on sample complexity that is linear in needle dimension for a natural type of search protocol, and a linear lower bound for a class of constrained problems. We describe the relationship between needle dimension and VC dimension, and give several simple examples.

Title: Testing Problems with Sub-Learning Sample Complexity Authors: Michael Kearns and Dana Ron Email: mkearns@research.att.com, danar@theory.lcs.mit.edu Home_page: http://www.research.att.com/~mkearns, http://theory.lcs.mit.edu/~danar Affiliation: AT&T Labs - Research, MIT Laboratory for Computer Science Address: 180 Park Avenue Florham Park NJ, 545 Technology Square Cambridge MA Thanks_to: The second author is supported by an ONR Science Scholar Fellowship at the Bunting Institute Abstract: We study the problem of determining, for a class of functions $H$, whether an unknown target function $f$ is contained in $H$ or is ``far'' from any function in $H$. Thus, in contrast to problems of {\em learning\/}, where we must {\em construct\/} a good approximation to $f$ in $H$ on the basis of sample data, in problems of {\em testing\/} we are only required to determine the {\em existence\/} of a good approximation. Our main results demonstrate that, over the domain $[0,1]^d$ for constant $d$, the number of examples required for testing grows only as $O(s^{1/2+\delta})$ (where $\delta$ is any small constant), for both decision trees of size $s$ and a special class of neural networks with $s$ hidden units. This is in contrast to the $\Omega(s)$ examples required for learning these same classes. Our tests are based on combinatorial constructions demonstrating that these classes can be approximated by small classes of coarse partitions of space, and rely on repeated application of the well-known Birthday Paradox.

Title: Polylogarithmic-Overhead Piecemeal Graph Exploration Authors: Baruch Awerbuch and Stephen G. Kobourov Email: baruch@cs.jhu.edu, kobourov@cs.jhu.edu Home_page: http://www.cs.jhu.edu/~baruch http://www.cs.jhu.edu/~kobourov Affiliation: Johns Hopkins University Address: Department of Computer Science 3400 N Charles Street Baltimore, MD 21218 Thanks_to: This research was supported by ARPA/Army contract DABT63-93-C-0038, DARPA/Airforce contract F30602961029, NSF Award ID CCR9700157, and ARO contract \#DAAH04-95-1-0607 Abstract: We introduce a new traversal technique in the context of piecemeal exploration of unknown graphs. The problem of learning a graph via piecemeal exploration requires a robot to create a complete map of its environment, subject to two constraints. First, it cannot jump between non-adjacent vertices in one time step and second, it must return to a fixed starting point every so often. This paper presents the recursive piecemeal search (RPS) strategy together with an algorithm for the above problem. We are able to achieve O(log^2 n) overhead (where n is the number of vertices), improving on previous results of Awerbuch, Betke, Rivest, and Singh which require O(n^\epsilon) overhead. The graph is discovered via the recursive piecemeal search, which can be viewed as a combination of breadth-first and depth-first passes. The construction of RPS trees relies on the concept of sparse neighborhood covers and captures nicely the nature of the graph exploration problem.

Title: Projection Learning Author: Leslie G. Valiant Email: valiant@deas.harvard.edu Home_page: http://www.deas.harvard.edu/users/faculty/ Leslie_Valiant/hp.html Affiliation: Harvard University Address: Division of Engineering and Applied Sciences 40 Oxford Street Cambridge, MA 02138 Abstract: A method of combining learning algorithms is described that preserves attribute efficiency. It yields learning algorithms that require a number of examples that is polynomial in the number of relevant variables and logarithmic in the number of irrelevant ones. The algorithms are simple to implement and realizable on networks with a number of nodes linear in the total number of variables. They include several strict generalizations of Littlestone's Winnow algorithm, and are, therefore, good candidates for experimentation on domains having very large numbers of attributes but where nonlinear hypotheses are sought.

Title: The Query Complexity of Finding Local Minima in the Lattice Author1: Amos Beimel Email: beimel@deas.harvard.edu Home_page: http://www.deas.harvard.edu/~beimel Affiliation: Harvard University Address: Division of Engineering & Applied Sciences Harvard University 40 Oxford st., Cambridge, MA 02138. Thanks_to: Supported by grants ONR-M00014-96-1-0550 and ARO-DAAL-03-92-G0115 Author2: Felix Geller Email: felix@cs.technion.ac.il Affiliation: Technion Address: Computer Science Department Technion, Haifa 32000, Israel Author3: Eyal Kushilevitz Email: eyalk@cs.technion.ac.il Home_page: http://www.cs.technion.ac.il/~eyalk Affiliation: Technion Address: Computer Science Department Technion, Haifa 32000, Israel Thanks_to: Supported by Technion V.P.R. Fund 120-872, by Japan Technion Society Research Fund, by the fund for the promotion of research at the Technion, and by the German-Israeli Foundation for scientific research and development (GIF). Abstract: In this paper we study the query complexity of finding local minimum points of a boolean function. This task occurs frequently in exact learning many natural classes, such as monotone DNF, O( log n )-term DNF, unate DNF and decision trees. On the negative side, we prove that any (possibly randomized) algorithm that produces a local minimum of a function f chosen from a sufficiently ``rich'' concept class, using a membership oracle for f, must ask \Omega(n^2) membership queries in the worst case. In particular, this lower bound applies to the class of decision trees. A simple algorithm is known that achieves this lower bound. On the positive side, we show that for the class O( log n )-term DNF finding local minimum points requires only \Theta (n log n) membership queries (and more generally \Theta( nt ) membership queries for t-term DNF with t < n). This efficient procedure improves the time and query complexity of known learning algorithms for the class O( log n )-term DNF.