COLT 1998 schedule

Friday, July 24, 1998

Joint COLT/ICML/UAI Invited Talk (8:30 - 9:30)
Reinforcement Learning: How Far Can It Go?
Rich Sutton

Session 1 (10:00 - 11:40)
On Sequential Prediction of Individual Sequences Relative to a Set of Experts,
Nicolo Cesa-Bianchi, Gabor Lugosi
Universal Portfolio Selection,
Volodya Vovk, Chris Watkins
Tracking the Best Regressor,
Mark Herbster, Manfred Warmuth
Minimax Relative Loss Analysis for Sequential Prediction Algorithms Using Parametric Hypotheses,
Kenji Yamanishi

Session 2 (1:30 - 3:10)
Robust Learning Aided by Context,
John Case, Sanjay Jain, Matthias Ott, Arun Sharma, Frank Stephan
Birds can fly....,
Jochen Nessel
Learnability of a subclass of extended pattern languages,
Andrew Mitchell
Aspects of Complexity of Conservative Probabilistic Inductive Inference,
Lea Meyer

COLT Invited Talk (3:30 - 4:30)
Learning to Communicate via Unknown Channel,
Meir Feder

Session 3 (4:30 - 5:20)
Improved Boosting Algorithms Using Confidence-rated Predictions,
Robert E. Schapire, Yoram Singer
Combining labeled and unlabeled data with co-training,
Avrim Blum, Tom Mitchell

Joint COLT/ICML/ILP/UAI banquet (7:00 - ), Madison Convention Center
Banquet speaker (9:00 - 9:30): David Spiegelhalter
2.5 Millennia of Directed Graphs

Saturday, July 25, 1998

Joint COLT/ICML/UAI Invited Talk (8:30 - 9:30)
Learning agents for uncertain environments,
Stuart Russell

Session 4 (10:00 - 11:40)
Improved Lower Bounds for Learning from Noisy Examples,
Claudio Gentile, David P. Helmbold
The complexity of learning according to two models of a drifting environment,
Philip M. Long
On the sample complexity of learning functions with bounded variation,
Philip M. Long
Efficient Learning of Monotone Concepts via Quadratic Optimization,
David Gamarnik

Session 5 (1:30 - 3:10)
Structural Results about Exact Learning with Unspecified Attribute Values,
Andreas Birkendorf, Norbert Klasner, Christian Kuhlmann, Hans U. Simon
Learning First Order Universal Horn Expressions,
Roni Khardon
Learning atomic formulas with prescribed properties,
Irene Tsapara, Gyorgy Turan
Exact learning of tree patterns from queries and counterexamples,
Thomas R. Amoth, Paul Cull, Prasad Tadepalli

Session 6 (3:30 - 4:20)
On the power of learning robustly,
Sanjay Jain, Carl Smith, Rolf Wiehagen
Learning 1-Variable Pattern Languages in Linear Average Time,
R\"udiger Reischuk, Thomas Zeugmann

ICML Invited Talk (4:30 - 5:30)
Crossing the Chasm: From Academic Machine Learning to Commercial Data Mining
Ron Kohavi

Joint COLT/ICML/ILP/UAI Poster Session, Madison Convention Center (7:00 - )
"staffed" 8:00-9:30 if first author's last name starts with A-L
"staffed" 9:30-11:00 of first author's last name starts with M-Z

Sunday, July 26, 1998

Joint COLT/ICML/UAI Invited Talk (8:30 - 9:30)
Conditional Independence: A Structural Framework for Uncertainty?
Philip Dawid

Session 7 (10:00 - 11:40)
Large Margin Classification Using the Perceptron Algorithm,
Yoav Freund, Robert E. Schapire
Cross-Validation for Binary Classification by Real-Valued Functions,
Martin Anthony, Sean B. Holden
Some PAC-Bayesian Theorems,
David A. McAllester
Neural networks and efficient associative memory,
M. Miltrup, G. Schnitger

Session 8 (1:30 - 3:10)
Self bounding learning algorithms,
Yoav Freund
Sample Complexity of Model-Based Search,
Christopher D. Rosin
Testing Problems with Sub-Learning Sample Complexity,
Michael Kearns, Dana Ron
Polylogarithmic-Overhead Piecemeal Graph Exploration,
Baruch Awerbuch, Stephen Kobourov

Session 9 (3:30 - 4:20)
Projection Learning,
Leslie G Valiant
The Query Complexity of Finding Local Minima in the Lattice,
Amos Beimel, Felix Geller, Eyal Kushilevitz

joint COLT/ICML/UAI panel discussion (4:30 - 5:30)
Organizers: Thomas Dietterich, David Heckerman, Michael Kearns (Mills Concert Hall, Humanities Bldg)


Time: Friday 10:00-10:25

Title:        On Sequential Prediction of Individual Sequences
              Relative to a Set of Experts
Authors:      Nicolo Cesa-Bianchi and Gabor Lugosi
Affiliation:  University of Milan, Italy,
              Pompeu Fabra University, Spain
         We investigate sequential randomized prediction of an arbitrary
sequence taking values from a binary alphabet. The goal of the predictor
is to minimize his Hamming loss relative to the loss of the best "expert"
in a fixed, possibly infinite, set of experts. We point out a surprisingly
close connection between the prediction problem and empirical process theory.
Using ideas and results from empirical process theory, we show upper and
lower bounds on the minimax relative loss in terms of the geometry of the
class of experts. As a main example, we determine the exact order of
magnitude of the minimax relative loss for the class of Markov experts.
Furthermore, in the special case of static experts, we completely
characterize the minimax relative loss in terms of the maximal deviation
of an associated Rademacher process.

Time: Friday 10:25-10:50

title:       Universal Portfolio Selection
authors:     V.Vovk and C.Watkins
e-mail:      {vovk,chrisw}
affiliation: Royal Holloway, University of London
address:     Department of Computer Science
             Egham, Surrey TW20 0EX, UK
  A typical problem in portfolio selection
  in stock markets
  is that it is not clear which of the many available
  strategies should be used.
  We apply a general algorithm of prediction with expert advice
  (the Aggregating Algorithm)
  to two different idealizations of the stock market.
  One is the well-known game introduced by Cover
  in connection with his ``universal portfolio'' algorithm;
  the other is a more realistic modification of Cover's game
  introduced in this paper,
  where market's participants are allowed to take
  ``short positions'',                            
  so that the algorithm may be applied            
  to currency and futures markets.                
  Besides applying the Aggregating Algorithm      
  to a countable (or finite) family of arbitrary  
  investment strategies,                          
  we also apply it,                               
  in the case of Cover's game,                    
  to the uncountable family of                    
  ``constant rebalanced portfolios''              
  considered by Cover.                            
  We generalize Cover's worst-case bounds for his 
  ``universal portfolio'' algorithm               
  (which can be regarded                          
  as a special case of the Aggregating Algorithm  
  corresponding to learning rate~1)               
  to the case of learning rates not exceeding~1.  
  we discuss                                      
  a general approach to designing investment strategies   
  in which,                                               
  instead of making statistical or other assumptions      
  about the market,                                       
  natural assumptions of computability are made           
  about possible investment strategies;                   
  this approach leads to natural extensions               
  of the notion of Kolmogorov complexity.                 

Time: Friday 10:50-11:15

Title:        Tracking the Best Regressor
Authors:      M. Herbster and M. Warmuth
Affiliation:  University of California at Santa Cruz
Address:      Computer Science Department,
              Applied Sciences, 
              Santa Cruz, California 95064 (USA)
Thanks_to:    This research was supported by grant CCR-9700201
              from the National Science Foundation.
In most of the on-line learning research the total on-line loss of the
algorithm is compared to the total loss of the best off-line predictor
$\bu$ from a comparison class of predictors.  We call such bounds
static bounds.  The interesting feature of these bounds is that they
hold for an arbitrary sequence of examples.  Recently some work has
been done where the comparison vector $\bu_t$ at each trial $t$ is
allowed to change with time, and the total on-line loss of the
algorithm is compared to the sum of the losses of $\bu_t$ at each
trial plus the total ``cost'' for shifting to successive comparison
vectors.  This is to model situations in which the examples change
over time and different predictors from the comparison class are best
for different segments of the sequence of examples.  We call such
bounds shifting bounds.  Shifting bounds still hold for arbitrary
sequences of examples and also for arbitrary partitions. The algorithm
does not know the off-line partition and the sequence of predictors
that its performance is compared against.

Naturally shifting bounds are much harder to prove.  The only known
bounds are for the case when the comparison class consists of a finite
sets of experts or boolean disjunctions. In this paper we develop the
methodology for lifting known static bounds to the shifting case.  In
particular we obtain bounds when the comparison class consists of
linear neurons (linear combinations of experts).  Our essential
technique consists of the following. At the end of each trial we {\em
project} the hypothesis of the static algorithm into a suitably chosen
convex region.  This keeps the hypothesis of the algorithm
well-behaved and the static bounds can be converted to shifting bounds
so that the cost for shifting remains reasonable.

Time: Friday 11:15-11:40

Title:   Minimax Relative Loss Analysis for Sequential 
         Prediction Algorithms Using Parametric Hypotheses   
Author:  Kenji Yamanishi
Affiliation: Theory NEC Laboratory, Real World Computing Partnership,
         C&C Media Research Laboratories, NEC Corporation,
Address: 1-1, 4-chome, Miyazaki, Miyamae-ku, Kawasaki, 
         Kanagawa 216, Japan.
  We consider the problem of sequential prediction using a parametric
hypothesis class of real-valued functions, e.g., linear predictors,  
piecewise polynomials, etc. We are specifically concerned with 
evaluating the minimax relative cumulative loss(RCL), which is 
defined as inf_{A}sup _{D}R(A:D).  Here R(A:D) is the difference 
between the cumulative loss for an algorithm A w.r.t. a data 
sequence D and the minimum cumulative loss over the class w.r.t. D. 
   The first contribution of this paper is to derive a non-asymptotic 
upper bound on the minimax RCL in a general decision-theoretic setting.
It is given by (k/2B)\ln m +C where k is the number of parameters, 
m is the sample size, B and C do not depend on m. We explicitly identify 
B and C by relating them to the hypothesis class and the loss function 
in a general form. To derive this bound, we propose an algorithm AG2, 
of which the key idea is to discretize the continuous parameter space 
with optimal size and then to apply Vovk's aggregating technique.
  The second contribution of this paper is to derive tight lower bounds 
on the minimax RCL. We thereby show that AG2 is optimal in the sense
that its upper bound on RCL matches the lower bound within o(\log m). 
  The third contribution of this paper is to propose a distributed 
variant of AG2, denoted by DAG, by making the learning process parallel. 
We prove that under certain constraints for the loss function and 
the hypothesis class, DAG suffers the same predictive performance as 
AG2 while achieving a significant speed-up of learning over it.  

Time: Friday 1:30-1:55

Title:        Robust Learning Aided by Context
Authors:      1) John Case 
              2) Sanjay Jain
              3) Matthias Ott 
              4) Arun Sharma
              5) Frank Stephan
Email:        1) 
Home_page:    1)
Affiliation:  1) University of Delaware 
              2) National University of Singapore 
              3) Universität Karlsruhe 
              4) University of New South Wales 
              5) Universität Heidelberg
Address:      1) Department of CIS 
                 Newark, DE 19716, USA 
              2) Department of Information Systems and Computer Science 
                 Singapore 119260, Republic of Singapore
              3) Institut für Logik, Komplexität 
                     und Deduktionssysteme
                 76128 Karlsruhe, Germany 
              4) School of Computer Science and Engineering 
                 Sydney 2052, Australia 
              5) Mathematisches Institut
                 Im Neuenheimer Feld 294
                 69120 Heidelberg, Germany
Thanks_to:    This work was carried out while J. Case, S. Jain, M. Ott, 
              and F. Stephan were visiting the School of Computer Science 
              and Engineering at the University of New South Wales  
              3) Supported by the Deutsche Forschungsgemeinschaft (DFG)
                 Graduiertenkolleg ``Beherrschbarkeit komplexer Systeme''
                 (GRK 209/2-96) 
              4) Supported by Australian Research Council Grant A49600456
              5) Supported by the Deutsche Forschungsgemeinschaft (DFG)
                 Grant Am 60/9-2
           Empirical studies of multitask learning provide some
evidence that the performance of a learning system on its intended
targets improves by presenting to the learning system related tasks,
also called contexts, as additional input. Angluin, Gasarch, and
Smith, as well as Kinber, Smith, Velauthapillai, and Wiehagen 
have provided mathematical justification for this phenomenon in the
inductive inference framework.  However, their proofs rely heavily on
self-referential coding tricks, that is, they directly code the
solution of the learning problem into the context. Fulk has shown that
for the EX- and BC-anomaly hierarchies, such results, which rely
on self-referential coding tricks, may not hold robustly. In this work
we analyze robust versions of learning aided by context and show that
--- in contrast to Fulk's result above --- the robust versions
of these learning notions are still very powerful. Also, studied is
the difficulty of the functional dependence between the intended
target tasks and useful associated contexts. 

Time: Friday 1:55-2:20

Title:          Birds can fly...
Authors:        Jochen Nessel
Affiliation:    University of Kaiserslautern
Address:        FB Informatik
                Postfach 3049
                67653 Kaiserslautern
  ...with a few exceptions. But in real life almost nobody pays much
attention to these exceptions. What we really are interested in is a 
practicable rule. In Inductive Inference some measures of complexity are 
known, in which the functions of finite support are difficult to learn. 
But this class consists only of functions which are ``the constant zero 
function with a few exceptions.''
  The following presents a uniform approach to modify three of these 
measures in a way that the functions of finite support become easy to 
learn, and tries to give intuitive support for the modified models. The
modification is in essence a formal distinction between rules and 
exceptions and  based on the assumption, that the complexity of a 
learning problem  should be determined by the difficulty of learning 
the rules, and not by the difficulty of finding the exceptions.

Time: Friday 2:20-2:45

Title:		Learnability of a subclass of extended pattern languages
Authors:	Andrew R. Mitchell
Affiliation:	University of New South Wales
Address:	Department of Artificial Intelligence,
		School of Computer Science and Engineering,
		University of New South Wales,
		Sydney 2052, Australia
Thanks_to:	This research was supported by an Australian Postgraduate Award 
		and the Australian Research Council Grant A49600456 to A. Sharma
	Angluin introduced the class of pattern languages as term languages 
over strings on a finite alphabet, and showed that the class  
is identifiable in the limit from texts positive data. In 
Angluin's definition of pattern languages, erasing substitutions are
disallowed. Extended pattern languages are term languages over 
strings when erasing substitutions are allowed.  The question of 
identifiability in the limit of extended pattern languages from texts 
has not been resolved, and is one of the outstanding open 
problems in inductive inference.
  However, there are at least two subclasses that are 
known to be identifiable.  Shinohara has shown that the the class of 
extended pattern languages that can be generated by regular patterns 
(these are patterns in which each variable occurs no more than once) 
is identifiable in the limit from texts.  Wright showed that the class 
of extended pattern languages where the pattern contains at most m distinct 
variables, for any positive integer m, is identifiable in the limit from texts.

The present paper considers a generalisation of regular patterns where
each variable occurring in the pattern occurs exactly $m$ times for 
some positive integer $m$.  Such patterns are referred to as 
quasi-regular patterns, and  for each m, the corresponding 
class of extended pattern languages is denoted QRP_m.  It is shown 
that for each m, the class QRP_m is identifiable 
in the limit from texts.  The proof employs a sophisticated combinatorial 
argument for bounding the search space. It is also shown that the class 
of the union of all QRP_m is identifiable in the limit from texts.  

Time: Friday 2:45-3:10

Title:        Aspects of Complexity of Conservative Probabilistic 		Learning
Authors:      Lea Meyer
Home_page:    http://www.iig.uni-
Affiliation:  Albert-Ludwigs-Universit=E4t Freiburg
Address:      Institut f=FCr Informatik und Gesellschaft
		Friedrichstr. 50
		79098 Freiburg
When dealing with probabilistic learning of indexed families under=20
monotonicity constraints, we receive high structured probabilistic
hierarchies. In particular, conservative probabilistic learning is more
powerful than conservative deterministic learning even if the probability
is close to 1 provided the learning machines are claimed to work with
respect to proper or class preserving hypothesis spaces.
The additional learning power of probabilistic learning machines can be
compensated when we allow the machines to ask ``questions'' to an oracle.
In this paper, we show that conservative probabilistic learning is less
powerful than conservative learning with K-oracle provided the probability
p is greater than 1/2. However, K is necessary, i.e., for each p>1/2
exists a learning problem which is conservatively learnable by a K-oracle
machine but not by an oracle machine having access to a oracle A which has
a lower Turing degree than K.
The main result is that for each oracle A weaker then the halting problem
K, there exists an indexed family L(A) which is conservatively identifiable
with p=3D1/2, and which exactly reflects the Turing degree of A, i.e., L(A)
is conservatively identifiable by an oracle machine M[B] iff the Turing
degree of A is lower or equal than the Turing degree of B. However, not
every indexed family which is conservatively identifiable with probability
p=3D 1/2 reflects the Turing degree of an oracle. Hence, conservative
probabilistic learning with p=3D1/2 is too ``rich'' to be measured in terms
of Turing-Complexity.

Time: Friday 3:30-4:30

Title:        Learning to Communicate via Unknown Channel
Authors:      Meir Feder
Affiliation:  Electrical Eng. dept., Tel-Aviv University

Time: Friday 4:30-4:55

Title:        Improved Boosting Algorithms Using Confidence-rated Predictions
Authors:      Robert E. Schapire and Yoram Singer
Affiliation:  AT&T Labs
Address:      180 Park Avenue
	      Florham Park, NJ  07932
              We describe several improvements to Freund and
Schapire's AdaBoost boosting algorithm, particularly in a setting in
which hypotheses may assign confidences to each of their predictions.
We give a simplified analysis of AdaBoost in this setting, and we show
how this analysis can be used to find improved parameter settings as
well as a refined criterion for training weak hypotheses.  We give a
specific method for assigning confidences to the predictions of
decision trees, a method closely related to one used by Quinlan.  This
method also suggests a technique for growing decision trees which
turns out to be identical to one proposed by Kearns and Mansour.

We focus next on how to apply the new boosting algorithms to
multiclass classification problems, particularly to the multi-label
case in which each example may belong to more than one class. We give
two boosting methods for this problem.  One of these leads to a new
method for handling the single-label case which is simpler but as
effective as techniques suggested by Freund and Schapire.  Finally, we
give some experimental results comparing a few of the algorithms
discussed in this paper.

Time: Friday 4:55-5:20

Title:        Combining Labeled and Unlabeled Data with Co-Training
Authors:      Avrim Blum and Tom Mitchell
Affiliation:  Carnegie Mellon University
Address:      School of Computer Science,
              Carnegie Mellon University,
              Pittsburgh PA 15213-3891
Thanks_to:    This research was supported in part by the DARPA HPKB
              program under contract F30602-97-1-0215 and by NSF National 
              Young Investigator grant CCR-9357793

  We consider the problem of using a large unlabeled sample to boost
performance of a learning algorithm when only a small set of labeled
examples is available.  In particular, we consider a problem setting
motivated by the task of learning to classify web pages, in which the
description of each example can be partitioned into two distinct
views.  For example, the description of a web page can be partitioned
into the words occurring on that page, and the words occurring in
hyperlinks that point to that page.  We assume that either view of the
example would be sufficient for learning if we had enough labeled
data, but our goal is to use both views together to allow inexpensive
unlabeled data to augment a much smaller set of labeled
examples.  Specifically, the presence of two distinct views of each
example suggests strategies in which two learning algorithms are
trained separately on each view, and then each algorithm's predictions
on new unlabeled examples are used to enlarge the training set of the
other.  Our goal in this paper is to provide a PAC-style analysis for
this setting, and, more broadly, a PAC-style framework for the general
problem of learning from both labeled and unlabeled data.  We also
provide empirical results on real web-page data indicating that this
use of unlabeled examples can lead to significant improvement of
hypotheses in practice.

Time: Saturday 8:30-9:30

Title:        Learning agents for uncertain environments
Authors:      Stuart Russell
Affiliation:  Computer Science Dept., University of California, Berkeley.

Time: Saturday 10:00-10:25

Title:       Improved Lower Bounds for Learning from Noisy Examples:
             an Information-Theoretic Approach
Authors:     Claudio Gentile, David P. Helmbold
Affiliation: Universita' di Milano, 
	     University of California Santa Cruz
Address:     DSI, Via Comelico 39, 20135 Milano, Italy,
	     Computer Science Department, 95064 Santa Cruz, USA
Thanks_to:   The second author is supported by NSF grant CCR 9700201

This paper presents 
a general information-theoretic approach for obtaining lower bounds
on the number of examples needed to PAC learn in the presence 
of noise. 
This approach deals directly with the fundamental information 
quantities, avoiding a Bayesian analysis.
The technique is applied to several different models, 
illustrating its generality and power.
The resulting bounds add logarithmic factors to (or improve 
the constants in) previously known lower bounds.          

Time: Saturday 10:25-10:50

Title: The complexity of learning according to two models of a drifting environment
Authors: P. M. Long
Affiliation: National University of Singapore
Address: ISCS Department
         National University of Singapore
         Singapore 119260, Republic of Singapore
We establish new sufficient conditions for learning in a drifting environment
for both the agnostic and the realizable cases.  Both match known necessary
conditions to within constant factors.  We provide a simpler proof of a 
bound on the sample complexity of agnostic learning in a fixed environment
that is within a constant factor of the best possible.

Time: Saturday 10:50-11:15

Title:        On the sample complexity of learning functions with bounded variation
Authors:      P. M. Long
Affiliation:  National University of Singapore
Address:      ISCS Department
              National University of Singapore
              Singapore 119260, Republic of Singapore
Thanks_to:    This research was supported by National University of Singapore Academic Research Fund Grant RP960625.
We show that the class BV of [0,1]-valued functions with total
variation at most 1 can be agnostically learned with respect to the
absolute loss in polynomial time from O((1/epsilon)^2 log (1/delta))
examples, matching a known lower bound to within a constant factor.
We establish a bound of O(1/m) on the expected error of a
polynomial-time algorithm for learning BV in the prediction model,
also matching a known lower bound to within a constant factor.
Applying a known algorithm transformation to our prediction algorithm,
we obtain a polynomial-time PAC learning algorithm for BV with a
sample complexity bound of O((1/epsilon)log(1/delta)); this also
matches a known lower bound to within a constant factor.

Time: Saturday 11:15-11:40

Title: 		Efficient Learning of Monotone Concepts via Quadratic Optimization
Authors:  	David Gamarnik
Affiliation:  IBM, T. J. Watson Research Center
Address:      IBM, T. J. Watson Research Center
				P.O. Box 218,
				Yorktown Heights, NY 10598

Time: Saturday 1:30-1:55

Title:		Structural Results about Exact Learning with
		Unspecified Attribute Values
Authors:	Andreas Birkendorf, Norbert Klasner,
		Christian Kuhlmann, and Hans U. Simon
Affiliation:	Ruhr-Universitaet Bochum
Address:	Lehrstuhl Mathematik und Informatik,,
		Fakultaet fuer Mathematik,
		D-44780 Bochum,
Thanks_to:	The authors gratefully acknowledge the support of
		Deutsche Forschungsgemeinschaft grant Si 498/3-2 and
		the support of the German-Israeli Foundation for
		Scientific Research and Development grant
	This paper deals with the UAV learning model of Goldman,
Kwek and Scott [GoldmanKS97], where ``UAV'' is the acronym for
``Unspecified Attribute Values''.  As in [GoldmanKS97], we consider
exact learning within the UAV framework.  A smooth transition between
exact learning in the UAV setting and standard exact learning is
obtained by putting a fixed bound r on the number of unspecified
attribute values per instance.  For r=0, we obtain the standard
model.  For r=n (the total number of attributes), we obtain the
(unrestricted) UAV model.  Between these extremes, we find the
hierarchies (UAV-MQ_r)_{0 <= r <= n}, (UAV-EQ_r)_{0 <= r <= n}, and
(UAV-ARB-EQ_r)_{0 <= r <= n}.

	Our main results are as follows.  We present various lower
bounds on the number of ARB-EQs and UAV-MQs in terms of the Vapnik
Chervonenkis dimension of the concept class.  We show furthermore that
a natural extension of Angluin's Sunflower Lemma [Angluin88] is still
applicable in the exact UAV learning model.  Our UAV Sunflower Lemma
allows to establish exponentially large lower bounds on the necessary
number of UAV-MQs for several popular concept classes.  On the other
hand, we can show that slight simplifications of these classes are
efficiently learnable using only few UAV-MQs.  Finally, we investigate
the inherent structure of the aforementioned three hierarchies and the
relations between them.  It turns out that query type UAV-EQ_{r-1} is
strictly stronger than UAV-EQ_r (for each constant r).  The analogous
result for UAV-ARB-EQ is valid.  Furthermore, UAV-MQ_{r+omega(log n)}
is strictly stronger than UAV-MQ_r.  We also determine the relation
between query types chosen from different hierarchies.

Time: Saturday 1:55-2:20

Title:        Learning First Order Universal Horn Expressions
Authors:      R. Khardon 

Affiliation:  University of Edinburgh
Address:      Department of Computer Science
              University of Edinburgh 
              The King's Buildings 
              Edinburgh EH9 3JZ 
Thanks_to:    Part of this work was done while the author was at 
              Harvard University and supported by ARO grant 
              DAAL03-92-G-0115 and ONR grant N00014-95-1-0550.

	The problem of learning universally quantified function free
first order Horn expressions from equivalence and membership queries
is studied.  The results presented generalise the known algorithm for
propositional Horn expressions for several models of learning in first
order logic.  It is shown that exact learning is possible with
membership and equivalence queries with resources polynomial in the
number of clauses in the expression, though superpolynomial in the
number of universally quantified variables.  Similar results for
related models including entailment queries and ILP are also derived.

Time: Saturday 2:20-2:45

Title:      Learning atomic formulas with prescribed properties
Authors:      I.Tsapara and G.Turan
Affiliation:  	University of Illinois at Chicago
G. Turan is also affiliated with Artificial Intelligence Research Group of 
the Hungarian Academy of Sciences, Szeged, Hungary.
I.Tsapara's current affiliation "Investment Technologies International, 
Chicago, IL"
Address:      	Department of MSCS,
              	851 S.Morgan, M/C 249
    	Chicago, IL, 60607-7045, USA
Thanks_to:   Partially supported by grants ESPRIT 20237 and
OTKA T-016349.

         We consider the learnability of some concept classes in predicate 
logic with proper equivalence queries. Concepts are represented by atomic 
formulas over a restricted language. The concept represented by an atomic 
formula consists of its ground instances having bounded depth.
In addition, it is assumed that there is a first-order sentence given, and 
the concept class contains only those atomic formulas which satisfy this 
For instance, one may consider the learnability of atomic formulas that 
represent symmetric, or transitive concepts.
It is shown that every such concept class can be learned with O(m^2 + m log 
n) queries, where m is the quantifier rank of the sentence and n is the 
depth bound for the ground instances. The proof uses a combination of tools 
from logic and learning algorithms for geometric concepts. Model theoretic 
games are used to determine the structure of the concept classes. We 
formulate a constrained version of the problem of learning two-dimensional 
axis-parallel rectangles,
where one corner is required to belong to a prespecified subset. A 
sufficient condition is given for the efficient learnability of a rectangle 
in terms of the geometric properties of this subset.
The algorithm for the predicate logic learning problem is obtained by 
the learning algorithm for the constrained rectangle learning problem with 
information given by the logical analysis of the structure of the concept 

Time: Saturday 2:45-3:10

Title:        Exact Learning of Tree Patterns from Queries and
Authors:      Thomas R. Amoth and Paul Cull and Prasad Tadepalli
Email:        {amotht,pc,tadepall}
Affiliation:  Oregon State University
Address:      Department of Computer Science
              Oregon State University
              Corvallis, OR 97331
Thanks_to:    This research was partially supported by NSF under grant
	      number IRI-9520243.
We consider learning tree patterns from queries. The instances
are ordered and unordered trees with nodes labeled by constant identifiers.
The concepts are tree patterns and unions of tree patterns (forests) 
where all the internal nodes are labeled
with constants and the leaves are labeled with constants or variables. A
tree pattern matches any tree with its variables replaced
with constant subtrees. We show that ordered trees, in which the children 
are matched in a strict left-to-right order, are exactly learnable from 
equivalence queries, while ordered forests are learnable from equivalence
and membership queries. Unordered trees are exactly learnable from superset 
queries, and unordered forests are learnable from superset and 
equivalence queries. Negatively, we also show that each of the 
query types used is necessary for learning each concept class. 

Time: Saturday 3:30-3:55

Title:       On the Power of Learning Robustly
Authors:     Sanjay Jain, Carl Smith and Rolf Wiehagen
Home_page:   (1),
Affiliation: (1) National University of Singapore,
             (2) University of Maryland,
             (3) University of Kaiserslautern
Address:     (1) Department of ISCS,
             National University of Singapore,
             Singapore 119260
             (2) Department of Computer Science,
             University of Maryland,
             College Park, MD 20742,
             (3) Department of Computer Science,
             University of Kaiserslautern,
             D-67653 Kaiserslautern,
        Intuitively, a class of objects is robustly learnable if not only 
this class itself is learnable but all of its computable 
transformations do remain learnable
as well. In that sense, being learnable robustly seems to be a desirable
property in all fields of learning.
We will study this phenomenon within the paradigm of inductive inference.
Here a class of recursive functions is called robustly learnable under a
success criterion I iff all of its images under general recursive 
operators are learnable under the criterion I. Fulk showed the existence 
of a non-trivial class which is robustly learnable under the criterion 
Ex. However, several of the hierarchies (such as the anomaly hierarchies 
for Ex and Bc) do not stand robustly. Hence, up to now it was not 
clear if robust learning is really rich. The main intention of this 
paper is to give strong evidence that robust learning is rich.
Our main result proved by a priority construction is that the mind 
change hierarchy for Ex stands robustly. Moreover, the hierarchies 
of team learning for both Ex and Bc stand robustly as well.
In several contexts, we observe the surprising fact that a more 
complex topological structure of the classes to be learned leads 
to positive robustness results, whereas an easy topological structure 
yields negative results. We also show the counter-intuitive fact 
that even some self-referential classes can be learned robustly. 
Some of our results point out the difficulty of robust learning when only
a bounded number of mind changes is allowed.
Further results concerning uniformly robust learning are summarized.

Time: Saturday 3:55-4:20

Title:        Learning One-Variable Pattern Languages in Linear Average Time
Authors:      R. Reischuk and T. Zeugmann
Affiliation:  Medical University Luebeck
              Kyushu University
Address:      Institut fuer Theoretische Informatik
              Medizinische Universitaet Luebeck
              Wallstr. 40
              23560 Luebeck, Germany
              Department of Informatics
              Kyushu University
              Kasuga 816-8580, Japan

Thanks_to:    This research has been supported by the Japan Society for 
              the Promotion of Science under Grant JSPS 29716102

           A new algorithm for learning one-variable pattern languages is
proposed and analyzed with respect to its average-case behavior.
We consider the total learning time that takes into account all
operations till an algorithm has converged to a correct hypothesis.
For the expectation it is shown that for almost all meaningful distributions
defining how the pattern variable is replaced by a string to generate
random examples of the target pattern language this algorithm converges 
within a constant number of rounds with a total learning time that is 
linear in the pattern length.  
Thus, the algorithm is average-case optimal in a strong sense.

Though one-variable pattern languages cannot be inferred finitely, our 
approach can also be considered as probabilistic finite learning with high 

Time: Sunday 10:00-10:25

Title:        Large Margin Classification Using the Perceptron Algorithm
Authors:      Yoav Freund and Robert E. Schapire
Affiliation:  AT&T Labs
Address:      180 Park Avenue
	      Florham Park, NJ  07932
              We introduce and analyze a new algorithm for linear
classification which combines Rosenblatt's perceptron algorithm with
Helmbold and Warmuth's leave-one-out method.  Like Vapnik's
maximal-margin classifier, our algorithm takes advantage of data that
are linearly separable with large margins.  Compared to Vapnik's
algorithm, however, ours is much simpler to implement, and much more
efficient in terms of computation time.  We also show that our
algorithm can be efficiently used in very high dimensional spaces
using kernel functions.  We performed some experiments using our
algorithm, and some variants of it, for classifying images of
handwritten digits.  The performance of our algorithm is close to, but
not as good as, the performance of maximal-margin classifiers on the
same problem.

Time: Sunday 10:25-10:50

Title:        Cross-Validation for Binary Classification by Real-Valued 
              Functions: Theoretical Analysis
Authors:      Martin Anthony and Sean B. Holden
Affiliation:  London School of Economics
Address:      Department of Mathematics,
              London School of Economics,
              Houghton Street, 
              London WC2A 2AE, U.K.
Affiliation:  University College London
Address:      Department of Computer Science,
              University College London,
              Gower Street,
              London WC1E 6BT, U.K.
Thanks_to:    Some of this research was carried out while the authors were 
              guests of the Isaac Newton Institute for Mathematical Sciences, 
              Cambridge University.
This paper concerns the use of real-valued functions for binary
classification problems. Previous work in this area has concentrated
on using as an error estimate the `resubstitution' error (that is, the
empirical error of a classifier on the training sample) or its
derivatives. However, in practice, cross-validation and related
techniques are more popular. Here, we analyse theoretically the
accuracy of the holdout and cross-validation estimators for the case
where real-valued functions are used as classifiers. We then introduce
two new error estimation techniques, which we call the {\em adaptive
holdout estimate\/} and the {\em adaptive cross-validation
estimate\/}, and we perform a similar analysis for these. Finally, we
show how our results can be applied to certain types of neural

Time: Sunday 10:50-11:15

Title:        Some PAC-Bayesian theorems
Authors:      David McAllester
Affiliation:  AT&T Labs-Research
Address:      P.O. Box 971
	      180 Park Avenue
              Florham Park, NJ 07932-0971
This paper gives PAC guarantees for ``Bayesian'' algorithms ---
algorithms that optimize risk minimization expressions involving a
prior probability and a likelihood for the training data.
PAC-Bayesian algorithms are motivated by a desire to provide an
informative prior encoding information about the expected experimental
setting but still having PAC performance guarantees over all IID
settings.  The PAC-Bayesian theorems given here apply to an arbitrary
prior measure on an arbitrary concept space.  These theorems provide
an alternative to the use of VC dimension in proving PAC bounds for
parameterized concepts.

Time: Sunday 11:15-11:40

Title:        Neural Networks and Efficient Associative Memory
Authors:      M. Miltrup and G. Schnitger
Affiliation:  University of Frankfurt
Address:      Computer Science Department,
              Postfach 11 19 32
              60054 Frankfurt, Germany
Thanks_to:    Partially supported by DFG project Schn 503/1-2
Abstract:     We consider the familiarity problem:  for given patterns
n-bit patterns x_1,..,x_m  and n-bit query y one has to
decide whether y is sufficiently close to one of the patterns
with respect to Hamming distance.  Errors are allowed and inputs of
intermediate distance are excluded, when evaluating arbitrary memory
models for the familiarity problem.  We design neural circuits for
various distributions and show that the memory complexity of our circuits
(i.e., the logarithm of the total number of required circuits) is close
to optimal, even when compared to arbitrary memory models.  Moreover
we obtain near-optimal results simultaneously for the resources
memory complexity, depth and number of edges. Additionally the number
of vertices can be minimized in a somewhat restricted setting.
Circuits for the familiarity problem are considerably smaller
in terms of memory complexity, number of edges or number of gates
when compared to circuits for the conventional Hamming distance problem.

Time: Sunday 1:30-1:55

Title:        Self bounding learning algorithms
Authors:      Yoav Freund
Affiliation:  AT&T Labs - Research
Address:      AT&T Labs - Research
              180 Park ave, Room A205
              Florham Park, NJ 07932-0971
Most of the work which attempts to give bounds on the generalization
error of the hypothesis generated by a learning algorithm is based on
methods from the theory of uniform convergence. These bounds are
a-priori bounds that hold for any distribution of examples and are
calculated before any data is observed. In this paper we propose a
different approach for bounding the generalization error after the data
has been observed. A self-bounding learning algorithm is an algorithm
which, in addition to the hypothesis that it outputs, outputs a
reliable upper bound on the generalization error of this hypothesis.
We first explore the idea in the statistical query learning framework
of Kearns. After that we give an explicit self bounding
algorithm for learning algorithms that are based on local search.

Time: Sunday 1:55-2:20

Title:        Sample Complexity of Model-based Search
Authors:      Christopher D. Rosin
Affiliation:  The Scripps Research Institute and UCSD CSE Dept. 
Address:      The Scripps Research Institute, Mail Drop MB5
              10550 North Torrey Pines Road
              La Jolla, CA 92037
Thanks_to:    This work was supported by Burroughs Wellcome LJIS 
              grant APP #0842.  
We consider the problem of searching a domain for points that have a
desired property, in the special case where the objective function
that determines the properties of points is unknown and must be
learned during search.  We give a parallel to PAC learning theory that
is appropriate for reasoning about the sample complexity of this
problem.  The learner queries the true objective function at selected
points, and uses this information to choose models of the objective
function from a given hypothesis class that is known to contain a
correct model.  These models are used to focus the search on more
promising areas of the domain.  The goal is to find a point with the
desired property in a small number of queries.  We define an analog to
VC dimension, needle dimension, to be the size of the largest
sample in which any single point could have the desired property
without the other points' values revealing this information.  We give
an upper bound on sample complexity that is linear in needle dimension
for a natural type of search protocol, and a linear lower bound for a
class of constrained problems.  We describe the relationship between
needle dimension and VC dimension, and give several simple examples.

Time: Sunday 2:20-2:45

Title:        Testing Problems with Sub-Learning Sample Complexity
Authors:      Michael Kearns and Dana Ron
Affiliation:  AT&T Labs - Research,
              MIT Laboratory for Computer Science
Address:      180 Park Avenue Florham Park NJ,
              545 Technology Square Cambridge MA
Thanks_to:    The second author is supported by an ONR Science Scholar
              Fellowship at the Bunting Institute
We study the problem of determining, for a class of functions $H$, 
whether an unknown target function $f$ is contained in $H$ or is 
``far'' from any function in $H$. Thus, in contrast to problems of 
{\em learning\/}, where we must {\em construct\/} a good approximation 
to $f$ in $H$ on the basis of sample data, in problems of {\em testing\/} 
we are only required to determine the {\em existence\/} of a good 
approximation. Our main results demonstrate that, over the domain 
$[0,1]^d$ for constant $d$, the number of examples required for testing 
grows only as $O(s^{1/2+\delta})$ (where $\delta$ is any small constant), 
for both decision trees of size $s$ and a special class of neural networks 
with $s$ hidden units.
This is in contrast to the $\Omega(s)$ examples required for learning
these same classes. Our tests are based on combinatorial constructions
demonstrating that these classes can be approximated by small classes of
coarse partitions of space, and rely on repeated application of the 
well-known Birthday Paradox.

Time: Sunday 2:45-3:10

Title:        Polylogarithmic-Overhead Piecemeal Graph Exploration
Authors:      Baruch Awerbuch and Stephen G. Kobourov
Affiliation:  Johns Hopkins University
Address:      Department of Computer Science
              3400 N Charles Street 
              Baltimore, MD 21218
Thanks_to:    This research was supported by ARPA/Army contract
DABT63-93-C-0038, DARPA/Airforce contract F30602961029, NSF Award ID
CCR9700157, and  ARO contract \#DAAH04-95-1-0607
    We introduce a new traversal technique in the context of
    piecemeal exploration of unknown graphs. The problem of learning a
    graph via piecemeal exploration requires a robot to create a
    complete map of its environment, subject to two constraints.
    First, it cannot jump between non-adjacent vertices in one time
    step and second, it must return to a fixed starting point every so
    often. This paper presents the recursive piecemeal search (RPS)
    strategy together with an algorithm for the above problem. We are
    able to achieve O(log^2 n) overhead (where n is the number of
    vertices), improving on previous results of Awerbuch, Betke,
    Rivest, and Singh which require O(n^\epsilon) overhead. The
    graph is discovered via the recursive piecemeal search, which can
    be viewed as a combination of breadth-first and depth-first passes. The
    construction of RPS trees relies on the concept of sparse
    neighborhood covers and captures nicely the nature of the graph
    exploration problem.

Time: Sunday 3:30-3:55

Title: 	       Projection Learning
Author:        Leslie G. Valiant
Affiliation:   Harvard University
Address:       Division of Engineering and Applied Sciences
               40 Oxford Street
               Cambridge, MA  02138

        A method of combining learning algorithms is described that preserves
attribute efficiency.   It yields learning algorithms  that
require a number of examples that is polynomial in the number
of relevant variables and logarithmic in the number of irrelevant
ones.  The algorithms are simple to implement and realizable on
networks with a number of nodes linear in the total number of
variables.  They include several  strict generalizations of
Littlestone's Winnow algorithm, and are, therefore, good candidates
for experimentation on  domains
having very large numbers of attributes but where nonlinear hypotheses
are sought.

Time: Sunday 3:55-4:20

Title:       The Query Complexity of Finding Local Minima in the Lattice       

Author1:     Amos Beimel     
Affiliation: Harvard University  
Address:     Division of Engineering & Applied Sciences     
             Harvard University     
             40 Oxford st., Cambridge, MA 02138.     
Thanks_to:   Supported by grants ONR-M00014-96-1-0550 and
Author2:     Felix Geller    
Affiliation: Technion  
Address:     Computer Science Department   
             Technion, Haifa 32000,  Israel  

Author3:     Eyal Kushilevitz     
Affiliation: Technion  
Address:     Computer Science Department   
             Technion, Haifa 32000,  Israel  
Thanks_to:   Supported by Technion V.P.R. Fund 120-872, by Japan
             Technion Society Research Fund, by the fund for the
             promotion of research at the Technion, and by the
             German-Israeli Foundation for scientific research and
             development (GIF). 

In this paper we study the query complexity of finding local minimum
points of a boolean function. This task occurs frequently in exact
learning many natural classes, such as monotone DNF, O( log n )-term
DNF, unate DNF and decision trees. On the
negative side, we prove that any (possibly randomized) algorithm that
produces a local minimum of a function f chosen from a sufficiently
``rich'' concept class, using a membership oracle for f, must ask
\Omega(n^2) membership queries in the worst case. In particular, this
lower bound applies to the class of decision trees. A simple algorithm
is known that achieves this lower bound. On the positive side, we show
that for the class O( log n )-term DNF finding local minimum points
requires only \Theta (n log n) membership queries (and more generally
\Theta( nt ) membership queries for t-term DNF with t < n). This
efficient procedure improves the time and query complexity of known
learning algorithms for the class O( log n )-term DNF.