COLT 1998 schedule

### Friday, July 24, 1998

Joint COLT/ICML/UAI Invited Talk (8:30 - 9:30)
Reinforcement Learning: How Far Can It Go?
Rich Sutton

Session 1 (10:00 - 11:40)
On Sequential Prediction of Individual Sequences Relative to a Set of Experts,
Nicolo Cesa-Bianchi, Gabor Lugosi
Universal Portfolio Selection,
Volodya Vovk, Chris Watkins
Tracking the Best Regressor,
Mark Herbster, Manfred Warmuth
Minimax Relative Loss Analysis for Sequential Prediction Algorithms Using Parametric Hypotheses,
Kenji Yamanishi

Session 2 (1:30 - 3:10)
Robust Learning Aided by Context,
John Case, Sanjay Jain, Matthias Ott, Arun Sharma, Frank Stephan
Birds can fly....,
Jochen Nessel
Learnability of a subclass of extended pattern languages,
Andrew Mitchell
Aspects of Complexity of Conservative Probabilistic Inductive Inference,
Lea Meyer

COLT Invited Talk (3:30 - 4:30)
Learning to Communicate via Unknown Channel,
Meir Feder

Session 3 (4:30 - 5:20)
Improved Boosting Algorithms Using Confidence-rated Predictions,
Robert E. Schapire, Yoram Singer
Combining labeled and unlabeled data with co-training,
Avrim Blum, Tom Mitchell

Joint COLT/ICML/ILP/UAI banquet (7:00 - ), Madison Convention Center
Banquet speaker (9:00 - 9:30): David Spiegelhalter
2.5 Millennia of Directed Graphs

### Saturday, July 25, 1998

Joint COLT/ICML/UAI Invited Talk (8:30 - 9:30)
Learning agents for uncertain environments,
Stuart Russell

Session 4 (10:00 - 11:40)
Improved Lower Bounds for Learning from Noisy Examples,
Claudio Gentile, David P. Helmbold
The complexity of learning according to two models of a drifting environment,
Philip M. Long
On the sample complexity of learning functions with bounded variation,
Philip M. Long
Efficient Learning of Monotone Concepts via Quadratic Optimization,
David Gamarnik

Session 5 (1:30 - 3:10)
Structural Results about Exact Learning with Unspecified Attribute Values,
Andreas Birkendorf, Norbert Klasner, Christian Kuhlmann, Hans U. Simon
Learning First Order Universal Horn Expressions,
Roni Khardon
Learning atomic formulas with prescribed properties,
Irene Tsapara, Gyorgy Turan
Exact learning of tree patterns from queries and counterexamples,

Session 6 (3:30 - 4:20)
On the power of learning robustly,
Sanjay Jain, Carl Smith, Rolf Wiehagen
Learning 1-Variable Pattern Languages in Linear Average Time,
R\"udiger Reischuk, Thomas Zeugmann

ICML Invited Talk (4:30 - 5:30)
Crossing the Chasm: From Academic Machine Learning to Commercial Data Mining
Ron Kohavi

Joint COLT/ICML/ILP/UAI Poster Session, Madison Convention Center (7:00 - )
"staffed" 8:00-9:30 if first author's last name starts with A-L
"staffed" 9:30-11:00 of first author's last name starts with M-Z

### Sunday, July 26, 1998

Joint COLT/ICML/UAI Invited Talk (8:30 - 9:30)
Conditional Independence: A Structural Framework for Uncertainty?
Philip Dawid

Session 7 (10:00 - 11:40)
Large Margin Classification Using the Perceptron Algorithm,
Yoav Freund, Robert E. Schapire
Cross-Validation for Binary Classification by Real-Valued Functions,
Martin Anthony, Sean B. Holden
Some PAC-Bayesian Theorems,
David A. McAllester
Neural networks and efficient associative memory,
M. Miltrup, G. Schnitger

Session 8 (1:30 - 3:10)
Self bounding learning algorithms,
Yoav Freund
Sample Complexity of Model-Based Search,
Christopher D. Rosin
Testing Problems with Sub-Learning Sample Complexity,
Michael Kearns, Dana Ron
Baruch Awerbuch, Stephen Kobourov

Session 9 (3:30 - 4:20)
Projection Learning,
Leslie G Valiant
The Query Complexity of Finding Local Minima in the Lattice,
Amos Beimel, Felix Geller, Eyal Kushilevitz

joint COLT/ICML/UAI panel discussion (4:30 - 5:30)
Organizers: Thomas Dietterich, David Heckerman, Michael Kearns (Mills Concert Hall, Humanities Bldg)

## Abstracts

Time: Friday 10:00-10:25

Title:        On Sequential Prediction of Individual Sequences
Relative to a Set of Experts
Authors:      Nicolo Cesa-Bianchi and Gabor Lugosi
Email:        cesabian@dsi.unimi.it, lugosi@upf.es
Home_page:    http://www.dsi.unimi.it/~cesabian,
http://bonvent.upf.es/~lugosi
Affiliation:  University of Milan, Italy,
Pompeu Fabra University, Spain
Abstract:
We investigate sequential randomized prediction of an arbitrary
sequence taking values from a binary alphabet. The goal of the predictor
is to minimize his Hamming loss relative to the loss of the best "expert"
in a fixed, possibly infinite, set of experts. We point out a surprisingly
close connection between the prediction problem and empirical process theory.
Using ideas and results from empirical process theory, we show upper and
lower bounds on the minimax relative loss in terms of the geometry of the
class of experts. As a main example, we determine the exact order of
magnitude of the minimax relative loss for the class of Markov experts.
Furthermore, in the special case of static experts, we completely
characterize the minimax relative loss in terms of the maximal deviation



title:       Universal Portfolio Selection
authors:     V.Vovk and C.Watkins
e-mail:      {vovk,chrisw}@dcs.rhbnc.ac.uk
affiliation: Royal Holloway, University of London
Egham, Surrey TW20 0EX, UK
abstract:
A typical problem in portfolio selection
in stock markets
is that it is not clear which of the many available
strategies should be used.
We apply a general algorithm of prediction with expert advice
(the Aggregating Algorithm)
to two different idealizations of the stock market.
One is the well-known game introduced by Cover
in connection with his universal portfolio'' algorithm;
the other is a more realistic modification of Cover's game
introduced in this paper,
where market's participants are allowed to take
short positions'',
so that the algorithm may be applied
to currency and futures markets.
Besides applying the Aggregating Algorithm
to a countable (or finite) family of arbitrary
investment strategies,
we also apply it,
in the case of Cover's game,
to the uncountable family of
constant rebalanced portfolios''
considered by Cover.
We generalize Cover's worst-case bounds for his
universal portfolio'' algorithm
(which can be regarded
as a special case of the Aggregating Algorithm
corresponding to learning rate~1)
to the case of learning rates not exceeding~1.
Finally,
we discuss
a general approach to designing investment strategies
in which,
instead of making statistical or other assumptions
natural assumptions of computability are made
this approach leads to natural extensions
of the notion of Kolmogorov complexity.



Title:        Tracking the Best Regressor
Authors:      M. Herbster and M. Warmuth
Email:        mark@cs.ucsc.edu, manfred@cs.ucsc.edu
Home_page:    http://www.cse.ucsc.edu/~mark,
http://www.cse.ucsc/edu/~manfred
Affiliation:  University of California at Santa Cruz
Applied Sciences,
Santa Cruz, California 95064 (USA)
Thanks_to:    This research was supported by grant CCR-9700201
from the National Science Foundation.
Abstract:
In most of the on-line learning research the total on-line loss of the
algorithm is compared to the total loss of the best off-line predictor
$\bu$ from a comparison class of predictors.  We call such bounds
static bounds.  The interesting feature of these bounds is that they
hold for an arbitrary sequence of examples.  Recently some work has
been done where the comparison vector $\bu_t$ at each trial $t$ is
allowed to change with time, and the total on-line loss of the
algorithm is compared to the sum of the losses of $\bu_t$ at each
trial plus the total cost'' for shifting to successive comparison
vectors.  This is to model situations in which the examples change
over time and different predictors from the comparison class are best
for different segments of the sequence of examples.  We call such
bounds shifting bounds.  Shifting bounds still hold for arbitrary
sequences of examples and also for arbitrary partitions. The algorithm
does not know the off-line partition and the sequence of predictors
that its performance is compared against.

Naturally shifting bounds are much harder to prove.  The only known
bounds are for the case when the comparison class consists of a finite
sets of experts or boolean disjunctions. In this paper we develop the
methodology for lifting known static bounds to the shifting case.  In
particular we obtain bounds when the comparison class consists of
linear neurons (linear combinations of experts).  Our essential
technique consists of the following. At the end of each trial we {\em
project} the hypothesis of the static algorithm into a suitably chosen
convex region.  This keeps the hypothesis of the algorithm
well-behaved and the static bounds can be converted to shifting bounds
so that the cost for shifting remains reasonable.



Title:   Minimax Relative Loss Analysis for Sequential
Prediction Algorithms Using Parametric Hypotheses
Author:  Kenji Yamanishi
Email:   yamanisi@ccm.cl.nec.co.jp
Affiliation: Theory NEC Laboratory, Real World Computing Partnership,
C&C Media Research Laboratories, NEC Corporation,
Address: 1-1, 4-chome, Miyazaki, Miyamae-ku, Kawasaki,
Kanagawa 216, Japan.
Abstract:
We consider the problem of sequential prediction using a parametric
hypothesis class of real-valued functions, e.g., linear predictors,
piecewise polynomials, etc. We are specifically concerned with
evaluating the minimax relative cumulative loss(RCL), which is
defined as inf_{A}sup _{D}R(A:D).  Here R(A:D) is the difference
between the cumulative loss for an algorithm A w.r.t. a data
sequence D and the minimum cumulative loss over the class w.r.t. D.
The first contribution of this paper is to derive a non-asymptotic
upper bound on the minimax RCL in a general decision-theoretic setting.
It is given by (k/2B)\ln m +C where k is the number of parameters,
m is the sample size, B and C do not depend on m. We explicitly identify
B and C by relating them to the hypothesis class and the loss function
in a general form. To derive this bound, we propose an algorithm AG2,
of which the key idea is to discretize the continuous parameter space
with optimal size and then to apply Vovk's aggregating technique.
The second contribution of this paper is to derive tight lower bounds
on the minimax RCL. We thereby show that AG2 is optimal in the sense
that its upper bound on RCL matches the lower bound within o(\log m).
The third contribution of this paper is to propose a distributed
variant of AG2, denoted by DAG, by making the learning process parallel.
We prove that under certain constraints for the loss function and
the hypothesis class, DAG suffers the same predictive performance as
AG2 while achieving a significant speed-up of learning over it.



Title:        Robust Learning Aided by Context
Authors:      1) John Case
2) Sanjay Jain
3) Matthias Ott
4) Arun Sharma
5) Frank Stephan
Email:        1) case@cis.udel.edu
2) sanjay@iscs.nus.edu.sg
3) m_ott@ira.uka.de
4) arun@cse.unsw.edu.au
5) fstephan@math.uni-heidelberg.de
Home_page:    1) http://www.cis.udel.edu/~case
2) http://www.iscs.nus.sg/~sanjay/
3) http://i11www.ira.uka.de/~m_ott/
4) http://www.cse.unsw.edu.au/~arun/
5) http://math.uni-heidelberg.de:80/logic/fstephan/fstephan.html
Affiliation:  1) University of Delaware
2) National University of Singapore
3) Universität Karlsruhe
4) University of New South Wales
5) Universität Heidelberg
Newark, DE 19716, USA
2) Department of Information Systems and Computer Science
Singapore 119260, Republic of Singapore
3) Institut für Logik, Komplexität
und Deduktionssysteme
76128 Karlsruhe, Germany
4) School of Computer Science and Engineering
Sydney 2052, Australia
5) Mathematisches Institut
Im Neuenheimer Feld 294
69120 Heidelberg, Germany
Thanks_to:    This work was carried out while J. Case, S. Jain, M. Ott,
and F. Stephan were visiting the School of Computer Science
and Engineering at the University of New South Wales
3) Supported by the Deutsche Forschungsgemeinschaft (DFG)
Graduiertenkolleg Beherrschbarkeit komplexer Systeme''
(GRK 209/2-96)
4) Supported by Australian Research Council Grant A49600456
5) Supported by the Deutsche Forschungsgemeinschaft (DFG)
Grant Am 60/9-2
Abstract:
Empirical studies of multitask learning provide some
evidence that the performance of a learning system on its intended
targets improves by presenting to the learning system related tasks,
also called contexts, as additional input. Angluin, Gasarch, and
Smith, as well as Kinber, Smith, Velauthapillai, and Wiehagen
have provided mathematical justification for this phenomenon in the
inductive inference framework.  However, their proofs rely heavily on
self-referential coding tricks, that is, they directly code the
solution of the learning problem into the context. Fulk has shown that
for the EX- and BC-anomaly hierarchies, such results, which rely
on self-referential coding tricks, may not hold robustly. In this work
we analyze robust versions of learning aided by context and show that
--- in contrast to Fulk's result above --- the robust versions
of these learning notions are still very powerful. Also, studied is
the difficulty of the functional dependence between the intended
target tasks and useful associated contexts.



Title:          Birds can fly...
Authors:        Jochen Nessel
Email:          nessel@informatik.uni-kl.de
Affiliation:    University of Kaiserslautern
Postfach 3049
67653 Kaiserslautern
Abstract:
...with a few exceptions. But in real life almost nobody pays much
attention to these exceptions. What we really are interested in is a
practicable rule. In Inductive Inference some measures of complexity are
known, in which the functions of finite support are difficult to learn.
But this class consists only of functions which are the constant zero
function with a few exceptions.''
The following presents a uniform approach to modify three of these
measures in a way that the functions of finite support become easy to
learn, and tries to give intuitive support for the modified models. The
modification is in essence a formal distinction between rules and
exceptions and  based on the assumption, that the complexity of a
learning problem  should be determined by the difficulty of learning
the rules, and not by the difficulty of finding the exceptions.



Title:		Learnability of a subclass of extended pattern languages
Authors:	Andrew R. Mitchell
Email:		andrewm@cse.unsw.edu.au
Home_page:	http://www.cse.unsw.edu.au/~andrewm
Affiliation:	University of New South Wales
School of Computer Science and Engineering,
University of New South Wales,
Sydney 2052, Australia
Thanks_to:	This research was supported by an Australian Postgraduate Award
and the Australian Research Council Grant A49600456 to A. Sharma
Abstract:
Angluin introduced the class of pattern languages as term languages
over strings on a finite alphabet, and showed that the class
is identifiable in the limit from texts positive data. In
Angluin's definition of pattern languages, erasing substitutions are
disallowed. Extended pattern languages are term languages over
strings when erasing substitutions are allowed.  The question of
identifiability in the limit of extended pattern languages from texts
has not been resolved, and is one of the outstanding open
problems in inductive inference.
However, there are at least two subclasses that are
known to be identifiable.  Shinohara has shown that the the class of
extended pattern languages that can be generated by regular patterns
(these are patterns in which each variable occurs no more than once)
is identifiable in the limit from texts.  Wright showed that the class
of extended pattern languages where the pattern contains at most m distinct
variables, for any positive integer m, is identifiable in the limit from texts.

The present paper considers a generalisation of regular patterns where
each variable occurring in the pattern occurs exactly $m$ times for
some positive integer $m$.  Such patterns are referred to as
quasi-regular patterns, and  for each m, the corresponding
class of extended pattern languages is denoted QRP_m.  It is shown
that for each m, the class QRP_m is identifiable
in the limit from texts.  The proof employs a sophisticated combinatorial
argument for bounding the search space. It is also shown that the class
of the union of all QRP_m is identifiable in the limit from texts.



Title:        Aspects of Complexity of Conservative Probabilistic 		Learning
Authors:      Lea Meyer
Email:        lea@modell.iig.uni-freiburg.de
Home_page:    http://www.iig.uni-		freiburg.de/modell/users/lea/index.html
Affiliation:  Albert-Ludwigs-Universit=E4t Freiburg
Address:      Institut f=FCr Informatik und Gesellschaft
Friedrichstr. 50
79098 Freiburg
Abstract:
When dealing with probabilistic learning of indexed families under=20
monotonicity constraints, we receive high structured probabilistic
hierarchies. In particular, conservative probabilistic learning is more
powerful than conservative deterministic learning even if the probability
is close to 1 provided the learning machines are claimed to work with
respect to proper or class preserving hypothesis spaces.
The additional learning power of probabilistic learning machines can be
compensated when we allow the machines to ask questions'' to an oracle.
In this paper, we show that conservative probabilistic learning is less
powerful than conservative learning with K-oracle provided the probability
p is greater than 1/2. However, K is necessary, i.e., for each p>1/2
exists a learning problem which is conservatively learnable by a K-oracle
machine but not by an oracle machine having access to a oracle A which has
a lower Turing degree than K.
The main result is that for each oracle A weaker then the halting problem
K, there exists an indexed family L(A) which is conservatively identifiable
with p=3D1/2, and which exactly reflects the Turing degree of A, i.e., L(A)
is conservatively identifiable by an oracle machine M[B] iff the Turing
degree of A is lower or equal than the Turing degree of B. However, not
every indexed family which is conservatively identifiable with probability
p=3D 1/2 reflects the Turing degree of an oracle. Hence, conservative
probabilistic learning with p=3D1/2 is too rich'' to be measured in terms
of Turing-Complexity.



Title:        Learning to Communicate via Unknown Channel
Authors:      Meir Feder
Email:        meir@eng.tau.ac.il
Affiliation:  Electrical Eng. dept., Tel-Aviv University


Title:        Improved Boosting Algorithms Using Confidence-rated Predictions
Authors:      Robert E. Schapire and Yoram Singer
Email:        schapire@research.att.com, singer@research.att.com
Home_page:    http://www.research.att.com/~schapire
http://www.research.att.com/~singer
Affiliation:  AT&T Labs
Florham Park, NJ  07932
Abstract:
We describe several improvements to Freund and
Schapire's AdaBoost boosting algorithm, particularly in a setting in
which hypotheses may assign confidences to each of their predictions.
We give a simplified analysis of AdaBoost in this setting, and we show
how this analysis can be used to find improved parameter settings as
well as a refined criterion for training weak hypotheses.  We give a
specific method for assigning confidences to the predictions of
decision trees, a method closely related to one used by Quinlan.  This
method also suggests a technique for growing decision trees which
turns out to be identical to one proposed by Kearns and Mansour.

We focus next on how to apply the new boosting algorithms to
multiclass classification problems, particularly to the multi-label
case in which each example may belong to more than one class. We give
two boosting methods for this problem.  One of these leads to a new
method for handling the single-label case which is simpler but as
effective as techniques suggested by Freund and Schapire.  Finally, we
give some experimental results comparing a few of the algorithms
discussed in this paper.



Title:        Combining Labeled and Unlabeled Data with Co-Training
Authors:      Avrim Blum and Tom Mitchell
Email:        avrim+@cs.cmu.edu, mitchell+@cs.cmu.edu
Home_page:    http://www.cs.cmu.edu/~avrim
http://www.cs.cmu.edu/~tom
Affiliation:  Carnegie Mellon University
Carnegie Mellon University,
Pittsburgh PA 15213-3891
Thanks_to:    This research was supported in part by the DARPA HPKB
program under contract F30602-97-1-0215 and by NSF National
Young Investigator grant CCR-9357793

Abstract:
We consider the problem of using a large unlabeled sample to boost
performance of a learning algorithm when only a small set of labeled
examples is available.  In particular, we consider a problem setting
motivated by the task of learning to classify web pages, in which the
description of each example can be partitioned into two distinct
views.  For example, the description of a web page can be partitioned
into the words occurring on that page, and the words occurring in
hyperlinks that point to that page.  We assume that either view of the
example would be sufficient for learning if we had enough labeled
data, but our goal is to use both views together to allow inexpensive
unlabeled data to augment a much smaller set of labeled
examples.  Specifically, the presence of two distinct views of each
example suggests strategies in which two learning algorithms are
trained separately on each view, and then each algorithm's predictions
on new unlabeled examples are used to enlarge the training set of the
other.  Our goal in this paper is to provide a PAC-style analysis for
this setting, and, more broadly, a PAC-style framework for the general
problem of learning from both labeled and unlabeled data.  We also
provide empirical results on real web-page data indicating that this
use of unlabeled examples can lead to significant improvement of
hypotheses in practice.



Title:        Learning agents for uncertain environments
Authors:      Stuart Russell
Email:        russell@cs.berkeley.edu
Affiliation:  Computer Science Dept., University of California, Berkeley.


Title:       Improved Lower Bounds for Learning from Noisy Examples:
an Information-Theoretic Approach
Authors:     Claudio Gentile, David P. Helmbold
Email:       gentile@dsi.unimi.it, dph@cse.ucsc.edu
Affiliation: Universita' di Milano,
University of California Santa Cruz
Address:     DSI, Via Comelico 39, 20135 Milano, Italy,
Computer Science Department, 95064 Santa Cruz, USA
Thanks_to:   The second author is supported by NSF grant CCR 9700201

Abstract:
This paper presents
a general information-theoretic approach for obtaining lower bounds
on the number of examples needed to PAC learn in the presence
of noise.
This approach deals directly with the fundamental information
quantities, avoiding a Bayesian analysis.
The technique is applied to several different models,
illustrating its generality and power.
The resulting bounds add logarithmic factors to (or improve
the constants in) previously known lower bounds.



Title: The complexity of learning according to two models of a drifting environment
Authors: P. M. Long
Email: plong@iscs.nus.edu.sg
Home_page: http://www.iscs.nus.edu.sg/~plong
Affiliation: National University of Singapore
National University of Singapore
Singapore 119260, Republic of Singapore
Abstract:
We establish new sufficient conditions for learning in a drifting environment
for both the agnostic and the realizable cases.  Both match known necessary
conditions to within constant factors.  We provide a simpler proof of a
bound on the sample complexity of agnostic learning in a fixed environment
that is within a constant factor of the best possible.



Title:        On the sample complexity of learning functions with bounded variation
Authors:      P. M. Long
Email:        plong@iscs.nus.edu.sg
Home_page:    http://www.iscs.nus.edu.sg/~plong
Affiliation:  National University of Singapore
National University of Singapore
Singapore 119260, Republic of Singapore
Thanks_to:    This research was supported by National University of Singapore Academic Research Fund Grant RP960625.
Abstract:
We show that the class BV of [0,1]-valued functions with total
variation at most 1 can be agnostically learned with respect to the
absolute loss in polynomial time from O((1/epsilon)^2 log (1/delta))
examples, matching a known lower bound to within a constant factor.
We establish a bound of O(1/m) on the expected error of a
polynomial-time algorithm for learning BV in the prediction model,
also matching a known lower bound to within a constant factor.
Applying a known algorithm transformation to our prediction algorithm,
we obtain a polynomial-time PAC learning algorithm for BV with a
sample complexity bound of O((1/epsilon)log(1/delta)); this also
matches a known lower bound to within a constant factor.



Title: 		Efficient Learning of Monotone Concepts via Quadratic Optimization
Authors:  	David Gamarnik
Email:		gamarnik@watson.ibm.com
Affiliation:  IBM, T. J. Watson Research Center
Address:      IBM, T. J. Watson Research Center
P.O. Box 218,
Yorktown Heights, NY 10598



Title:		Structural Results about Exact Learning with
Unspecified Attribute Values
Authors:	Andreas Birkendorf, Norbert Klasner,
Christian Kuhlmann, and Hans U. Simon
Email:		birkendo@lmi.ruhr-uni-bochum.de,
klasner@lmi.ruhr-uni-bochum.de,
kuhlmann@lmi.ruhr-uni-bochum.de,
simon@lmi.ruhr-uni-bochum.de
Home_page:	http://www.lmi.ruhr-uni-bochum.de/birkendorf/,
http://www.lmi.ruhr-uni-bochum.de/klasner/,
http://www.lmi.ruhr-uni-bochum.de/kuhlmann/,
http://www.lmi.ruhr-uni-bochum.de/simon/
Affiliation:	Ruhr-Universitaet Bochum
Fakultaet fuer Mathematik,
D-44780 Bochum,
Germany
Thanks_to:	The authors gratefully acknowledge the support of
Deutsche Forschungsgemeinschaft grant Si 498/3-2 and
the support of the German-Israeli Foundation for
Scientific Research and Development grant
I-403-001.06/95.
Abstract:
This paper deals with the UAV learning model of Goldman,
Kwek and Scott [GoldmanKS97], where UAV'' is the acronym for
Unspecified Attribute Values''.  As in [GoldmanKS97], we consider
exact learning within the UAV framework.  A smooth transition between
exact learning in the UAV setting and standard exact learning is
obtained by putting a fixed bound r on the number of unspecified
attribute values per instance.  For r=0, we obtain the standard
model.  For r=n (the total number of attributes), we obtain the
(unrestricted) UAV model.  Between these extremes, we find the
hierarchies (UAV-MQ_r)_{0 <= r <= n}, (UAV-EQ_r)_{0 <= r <= n}, and
(UAV-ARB-EQ_r)_{0 <= r <= n}.

Our main results are as follows.  We present various lower
bounds on the number of ARB-EQs and UAV-MQs in terms of the Vapnik
Chervonenkis dimension of the concept class.  We show furthermore that
a natural extension of Angluin's Sunflower Lemma [Angluin88] is still
applicable in the exact UAV learning model.  Our UAV Sunflower Lemma
allows to establish exponentially large lower bounds on the necessary
number of UAV-MQs for several popular concept classes.  On the other
hand, we can show that slight simplifications of these classes are
efficiently learnable using only few UAV-MQs.  Finally, we investigate
the inherent structure of the aforementioned three hierarchies and the
relations between them.  It turns out that query type UAV-EQ_{r-1} is
strictly stronger than UAV-EQ_r (for each constant r).  The analogous
result for UAV-ARB-EQ is valid.  Furthermore, UAV-MQ_{r+omega(log n)}
is strictly stronger than UAV-MQ_r.  We also determine the relation
between query types chosen from different hierarchies.



Title:        Learning First Order Universal Horn Expressions
Authors:      R. Khardon
Email:        roni@dcs.ed.ac.uk

Home_page:    http://www.dcs.ed.ac.uk/home/roni/

Affiliation:  University of Edinburgh
University of Edinburgh
The King's Buildings
Edinburgh EH9 3JZ
Scotland

Thanks_to:    Part of this work was done while the author was at
Harvard University and supported by ARO grant
DAAL03-92-G-0115 and ONR grant N00014-95-1-0550.

Abstract:
The problem of learning universally quantified function free
first order Horn expressions from equivalence and membership queries
is studied.  The results presented generalise the known algorithm for
propositional Horn expressions for several models of learning in first
order logic.  It is shown that exact learning is possible with
membership and equivalence queries with resources polynomial in the
number of clauses in the expression, though superpolynomial in the
number of universally quantified variables.  Similar results for
related models including entailment queries and ILP are also derived.



Title:      Learning atomic formulas with prescribed properties
Authors:      I.Tsapara and G.Turan
Email:        rena@itisolutions.com, gyt@uic.edu
Home_page:   http://www.math.uic.edu/~tsapara,
http://www.math.uic.edu/~turan
Affiliation:  	University of Illinois at Chicago
G. Turan is also affiliated with Artificial Intelligence Research Group of
the Hungarian Academy of Sciences, Szeged, Hungary.
I.Tsapara's current affiliation "Investment Technologies International,
Chicago, IL"
851 S.Morgan, M/C 249
Chicago, IL, 60607-7045, USA
Thanks_to:   Partially supported by grants ESPRIT 20237 and
OTKA T-016349.

Abstract:
We consider the learnability of some concept classes in predicate
logic with proper equivalence queries. Concepts are represented by atomic
formulas over a restricted language. The concept represented by an atomic
formula consists of its ground instances having bounded depth.
In addition, it is assumed that there is a first-order sentence given, and
the concept class contains only those atomic formulas which satisfy this
sentence.
For instance, one may consider the learnability of atomic formulas that
represent symmetric, or transitive concepts.
It is shown that every such concept class can be learned with O(m^2 + m log
n) queries, where m is the quantifier rank of the sentence and n is the
depth bound for the ground instances. The proof uses a combination of tools
from logic and learning algorithms for geometric concepts. Model theoretic
games are used to determine the structure of the concept classes. We
formulate a constrained version of the problem of learning two-dimensional
axis-parallel rectangles,
where one corner is required to belong to a prespecified subset. A
sufficient condition is given for the efficient learnability of a rectangle
in terms of the geometric properties of this subset.
The algorithm for the predicate logic learning problem is obtained by
combining
the learning algorithm for the constrained rectangle learning problem with
the
information given by the logical analysis of the structure of the concept
class.



Title:        Exact Learning of Tree Patterns from Queries and
Counterexamples
Home_page:    http://www.cs.orst.edu/~amotht
Affiliation:  Oregon State University
Oregon State University
Corvallis, OR 97331
Thanks_to:    This research was partially supported by NSF under grant
number IRI-9520243.
Abstract:
We consider learning tree patterns from queries. The instances
are ordered and unordered trees with nodes labeled by constant identifiers.
The concepts are tree patterns and unions of tree patterns (forests)
where all the internal nodes are labeled
with constants and the leaves are labeled with constants or variables. A
tree pattern matches any tree with its variables replaced
with constant subtrees. We show that ordered trees, in which the children
are matched in a strict left-to-right order, are exactly learnable from
equivalence queries, while ordered forests are learnable from equivalence
and membership queries. Unordered trees are exactly learnable from superset
queries, and unordered forests are learnable from superset and
equivalence queries. Negatively, we also show that each of the
query types used is necessary for learning each concept class.



Title:       On the Power of Learning Robustly
Authors:     Sanjay Jain, Carl Smith and Rolf Wiehagen
Email:       sanjay@iscs.nus.edu.sg, smith@cs.umd.edu, wiehagen@informatik.uni-kl.de
Home_page:   (1) http://www.iscs.nus.edu.sg/~sanjay,
(2) http://www.cs.umd.edu/~smith,
(3) http://www-agrw.informatik.uni-kl.de
Affiliation: (1) National University of Singapore,
(2) University of Maryland,
(3) University of Kaiserslautern
National University of Singapore,
Singapore 119260
(2) Department of Computer Science,
University of Maryland,
College Park, MD 20742,
USA
(3) Department of Computer Science,
University of Kaiserslautern,
D-67653 Kaiserslautern,
Germany
Thanks_to:
Abstract:
Intuitively, a class of objects is robustly learnable if not only
this class itself is learnable but all of its computable
transformations do remain learnable
as well. In that sense, being learnable robustly seems to be a desirable
property in all fields of learning.
We will study this phenomenon within the paradigm of inductive inference.
Here a class of recursive functions is called robustly learnable under a
success criterion I iff all of its images under general recursive
operators are learnable under the criterion I. Fulk showed the existence
of a non-trivial class which is robustly learnable under the criterion
Ex. However, several of the hierarchies (such as the anomaly hierarchies
for Ex and Bc) do not stand robustly. Hence, up to now it was not
clear if robust learning is really rich. The main intention of this
paper is to give strong evidence that robust learning is rich.
Our main result proved by a priority construction is that the mind
change hierarchy for Ex stands robustly. Moreover, the hierarchies
of team learning for both Ex and Bc stand robustly as well.
In several contexts, we observe the surprising fact that a more
complex topological structure of the classes to be learned leads
to positive robustness results, whereas an easy topological structure
yields negative results. We also show the counter-intuitive fact
that even some self-referential classes can be learned robustly.
Some of our results point out the difficulty of robust learning when only
a bounded number of mind changes is allowed.
Further results concerning uniformly robust learning are summarized.



Title:        Learning One-Variable Pattern Languages in Linear Average Time
Authors:      R. Reischuk and T. Zeugmann
Email:        reischuk@informatik.mu-luebeck.de
thomas@i.kyushu-u.ac.jp
Home_page:    http://www.itheoi.mu-luebeck.de/pages/reischuk/
http://www.i.kyushu-u.ac.jp/~thomas/index.html
Affiliation:  Medical University Luebeck
Kyushu University
Medizinische Universitaet Luebeck
Wallstr. 40
23560 Luebeck, Germany

Department of Informatics
Kyushu University
Kasuga 816-8580, Japan

Thanks_to:    This research has been supported by the Japan Society for
the Promotion of Science under Grant JSPS 29716102

Abstract:
A new algorithm for learning one-variable pattern languages is
proposed and analyzed with respect to its average-case behavior.
We consider the total learning time that takes into account all
operations till an algorithm has converged to a correct hypothesis.
For the expectation it is shown that for almost all meaningful distributions
defining how the pattern variable is replaced by a string to generate
random examples of the target pattern language this algorithm converges
within a constant number of rounds with a total learning time that is
linear in the pattern length.
Thus, the algorithm is average-case optimal in a strong sense.

Though one-variable pattern languages cannot be inferred finitely, our
approach can also be considered as probabilistic finite learning with high
confidence.




Title:        Large Margin Classification Using the Perceptron Algorithm
Authors:      Yoav Freund and Robert E. Schapire
Email:        yoav@research.att.com, schapire@research.att.com
Home_page:    http://www.research.att.com/~yoav
http://www.research.att.com/~schapire
Affiliation:  AT&T Labs
Florham Park, NJ  07932
Abstract:
We introduce and analyze a new algorithm for linear
classification which combines Rosenblatt's perceptron algorithm with
Helmbold and Warmuth's leave-one-out method.  Like Vapnik's
maximal-margin classifier, our algorithm takes advantage of data that
are linearly separable with large margins.  Compared to Vapnik's
algorithm, however, ours is much simpler to implement, and much more
efficient in terms of computation time.  We also show that our
algorithm can be efficiently used in very high dimensional spaces
using kernel functions.  We performed some experiments using our
algorithm, and some variants of it, for classifying images of
handwritten digits.  The performance of our algorithm is close to, but
not as good as, the performance of maximal-margin classifiers on the
same problem.




Title:        Cross-Validation for Binary Classification by Real-Valued
Functions: Theoretical Analysis
Authors:      Martin Anthony and Sean B. Holden
Email:        anthony@vax.lse.ac.uk, s.holden@cs.ucl.ac.uk
Home_page:    http://www.cs.ucl.ac.uk/staff/S.Holden/
Affiliation:  London School of Economics
London School of Economics,
Houghton Street,
London WC2A 2AE, U.K.
Affiliation:  University College London
University College London,
Gower Street,
London WC1E 6BT, U.K.
Thanks_to:    Some of this research was carried out while the authors were
guests of the Isaac Newton Institute for Mathematical Sciences,
Cambridge University.
Abstract:
This paper concerns the use of real-valued functions for binary
classification problems. Previous work in this area has concentrated
on using as an error estimate the resubstitution' error (that is, the
empirical error of a classifier on the training sample) or its
derivatives. However, in practice, cross-validation and related
techniques are more popular. Here, we analyse theoretically the
accuracy of the holdout and cross-validation estimators for the case
where real-valued functions are used as classifiers. We then introduce
two new error estimation techniques, which we call the {\em adaptive
holdout estimate\/} and the {\em adaptive cross-validation
estimate\/}, and we perform a similar analysis for these. Finally, we
show how our results can be applied to certain types of neural
network.


Title:        Some PAC-Bayesian theorems
Authors:      David McAllester
Email:        dmac@research.att.com
Home_page:    http://www.research.att.com/~dmac
Affiliation:  AT&T Labs-Research
180 Park Avenue
Florham Park, NJ 07932-0971
Abstract:
This paper gives PAC guarantees for Bayesian'' algorithms ---
algorithms that optimize risk minimization expressions involving a
prior probability and a likelihood for the training data.
PAC-Bayesian algorithms are motivated by a desire to provide an
informative prior encoding information about the expected experimental
setting but still having PAC performance guarantees over all IID
settings.  The PAC-Bayesian theorems given here apply to an arbitrary
prior measure on an arbitrary concept space.  These theorems provide
an alternative to the use of VC dimension in proving PAC bounds for
parameterized concepts.



Title:        Neural Networks and Efficient Associative Memory
Authors:      M. Miltrup and G. Schnitger
Email:        miltrup@thi.informatik.uni-frankfurt.de,
georg@informatik.uni-frankfurt.de,
Home_page:    http://www.informatik.uni-frankfurt.de/~foued/AG.html
Affiliation:  University of Frankfurt
Postfach 11 19 32
60054 Frankfurt, Germany
Thanks_to:    Partially supported by DFG project Schn 503/1-2
Abstract:     We consider the familiarity problem:  for given patterns
n-bit patterns x_1,..,x_m  and n-bit query y one has to
decide whether y is sufficiently close to one of the patterns
with respect to Hamming distance.  Errors are allowed and inputs of
intermediate distance are excluded, when evaluating arbitrary memory
models for the familiarity problem.  We design neural circuits for
various distributions and show that the memory complexity of our circuits
(i.e., the logarithm of the total number of required circuits) is close
to optimal, even when compared to arbitrary memory models.  Moreover
we obtain near-optimal results simultaneously for the resources
memory complexity, depth and number of edges. Additionally the number
of vertices can be minimized in a somewhat restricted setting.
Circuits for the familiarity problem are considerably smaller
in terms of memory complexity, number of edges or number of gates
when compared to circuits for the conventional Hamming distance problem.


Title:        Self bounding learning algorithms
Authors:      Yoav Freund
Email:        yoav@research.att.com
Home_page:    http://www.research.att.com/~yoav
Affiliation:  AT&T Labs - Research
180 Park ave, Room A205
Florham Park, NJ 07932-0971
Abstract:
Most of the work which attempts to give bounds on the generalization
error of the hypothesis generated by a learning algorithm is based on
methods from the theory of uniform convergence. These bounds are
a-priori bounds that hold for any distribution of examples and are
calculated before any data is observed. In this paper we propose a
different approach for bounding the generalization error after the data
has been observed. A self-bounding learning algorithm is an algorithm
which, in addition to the hypothesis that it outputs, outputs a
reliable upper bound on the generalization error of this hypothesis.
We first explore the idea in the statistical query learning framework
of Kearns. After that we give an explicit self bounding
algorithm for learning algorithms that are based on local search.



Title:        Sample Complexity of Model-based Search
Authors:      Christopher D. Rosin
Email:        crosin@scripps.edu
Home_page:    http://www.cse.ucsd.edu/users/crosin
Affiliation:  The Scripps Research Institute and UCSD CSE Dept.
Address:      The Scripps Research Institute, Mail Drop MB5
La Jolla, CA 92037
Thanks_to:    This work was supported by Burroughs Wellcome LJIS
grant APP #0842.
Abstract:
We consider the problem of searching a domain for points that have a
desired property, in the special case where the objective function
that determines the properties of points is unknown and must be
learned during search.  We give a parallel to PAC learning theory that
is appropriate for reasoning about the sample complexity of this
problem.  The learner queries the true objective function at selected
points, and uses this information to choose models of the objective
function from a given hypothesis class that is known to contain a
correct model.  These models are used to focus the search on more
promising areas of the domain.  The goal is to find a point with the
desired property in a small number of queries.  We define an analog to
VC dimension, needle dimension, to be the size of the largest
sample in which any single point could have the desired property
without the other points' values revealing this information.  We give
an upper bound on sample complexity that is linear in needle dimension
for a natural type of search protocol, and a linear lower bound for a
class of constrained problems.  We describe the relationship between
needle dimension and VC dimension, and give several simple examples.



Title:        Testing Problems with Sub-Learning Sample Complexity
Authors:      Michael Kearns and Dana Ron
Email:        mkearns@research.att.com,  danar@theory.lcs.mit.edu
Home_page:    http://www.research.att.com/~mkearns,
http://theory.lcs.mit.edu/~danar
Affiliation:  AT&T Labs - Research,
MIT Laboratory for Computer Science
Address:      180 Park Avenue Florham Park NJ,
545 Technology Square Cambridge MA
Thanks_to:    The second author is supported by an ONR Science Scholar
Fellowship at the Bunting Institute
Abstract:
We study the problem of determining, for a class of functions $H$,
whether an unknown target function $f$ is contained in $H$ or is
far'' from any function in $H$. Thus, in contrast to problems of
{\em learning\/}, where we must {\em construct\/} a good approximation
to $f$ in $H$ on the basis of sample data, in problems of {\em testing\/}
we are only required to determine the {\em existence\/} of a good
approximation. Our main results demonstrate that, over the domain
$[0,1]^d$ for constant $d$, the number of examples required for testing
grows only as $O(s^{1/2+\delta})$ (where $\delta$ is any small constant),
for both decision trees of size $s$ and a special class of neural networks
with $s$ hidden units.
This is in contrast to the $\Omega(s)$ examples required for learning
these same classes. Our tests are based on combinatorial constructions
demonstrating that these classes can be approximated by small classes of
coarse partitions of space, and rely on repeated application of the



Title:        Polylogarithmic-Overhead Piecemeal Graph Exploration
Authors:      Baruch Awerbuch and Stephen G. Kobourov
Email:        baruch@cs.jhu.edu, kobourov@cs.jhu.edu
Home_page:    http://www.cs.jhu.edu/~baruch
http://www.cs.jhu.edu/~kobourov
Affiliation:  Johns Hopkins University
3400 N Charles Street
Baltimore, MD 21218
Thanks_to:    This research was supported by ARPA/Army contract
DABT63-93-C-0038, DARPA/Airforce contract F30602961029, NSF Award ID
CCR9700157, and  ARO contract \#DAAH04-95-1-0607
Abstract:
We introduce a new traversal technique in the context of
piecemeal exploration of unknown graphs. The problem of learning a
graph via piecemeal exploration requires a robot to create a
complete map of its environment, subject to two constraints.
First, it cannot jump between non-adjacent vertices in one time
step and second, it must return to a fixed starting point every so
often. This paper presents the recursive piecemeal search (RPS)
strategy together with an algorithm for the above problem. We are
able to achieve O(log^2 n) overhead (where n is the number of
vertices), improving on previous results of Awerbuch, Betke,
Rivest, and Singh which require O(n^\epsilon) overhead. The
graph is discovered via the recursive piecemeal search, which can
be viewed as a combination of breadth-first and depth-first passes. The
construction of RPS trees relies on the concept of sparse
neighborhood covers and captures nicely the nature of the graph
exploration problem.



Title: 	       Projection Learning
Author:        Leslie G. Valiant
Email:         valiant@deas.harvard.edu
Home_page:     http://www.deas.harvard.edu/users/faculty/
Leslie_Valiant/hp.html
Affiliation:   Harvard University
Address:       Division of Engineering and Applied Sciences
40 Oxford Street
Cambridge, MA  02138

Abstract:
A method of combining learning algorithms is described that preserves
attribute efficiency.   It yields learning algorithms  that
require a number of examples that is polynomial in the number
of relevant variables and logarithmic in the number of irrelevant
ones.  The algorithms are simple to implement and realizable on
networks with a number of nodes linear in the total number of
variables.  They include several  strict generalizations of
Littlestone's Winnow algorithm, and are, therefore, good candidates
for experimentation on  domains
having very large numbers of attributes but where nonlinear hypotheses
are sought.



Title:       The Query Complexity of Finding Local Minima in the Lattice

Author1:     Amos Beimel
Email:       beimel@deas.harvard.edu
Home_page:   http://www.deas.harvard.edu/~beimel
Affiliation: Harvard University
Address:     Division of Engineering & Applied Sciences
Harvard University
40 Oxford st., Cambridge, MA 02138.
Thanks_to:   Supported by grants ONR-M00014-96-1-0550 and
ARO-DAAL-03-92-G0115

Author2:     Felix Geller
Email:       felix@cs.technion.ac.il
Affiliation: Technion
Technion, Haifa 32000,  Israel

Author3:     Eyal Kushilevitz
Email:       eyalk@cs.technion.ac.il
Home_page:   http://www.cs.technion.ac.il/~eyalk
Affiliation: Technion
Technion, Haifa 32000,  Israel
Thanks_to:   Supported by Technion V.P.R. Fund 120-872, by Japan
Technion Society Research Fund, by the fund for the
promotion of research at the Technion, and by the
German-Israeli Foundation for scientific research and
development (GIF).

Abstract:

In this paper we study the query complexity of finding local minimum
points of a boolean function. This task occurs frequently in exact
learning many natural classes, such as monotone DNF, O( log n )-term
DNF, unate DNF and decision trees. On the
negative side, we prove that any (possibly randomized) algorithm that
produces a local minimum of a function f chosen from a sufficiently
rich'' concept class, using a membership oracle for f, must ask
\Omega(n^2) membership queries in the worst case. In particular, this
lower bound applies to the class of decision trees. A simple algorithm
is known that achieves this lower bound. On the positive side, we show
that for the class O( log n )-term DNF finding local minimum points
requires only \Theta (n log n) membership queries (and more generally
\Theta( nt ) membership queries for t-term DNF with t < n). This
efficient procedure improves the time and query complexity of known
learning algorithms for the class O( log n )-term DNF.

`