The Loebner Prize Competition is a restricted Turing test to evaluate the "humaness" of computer programs which interact in natural language.
This is true story of my experience as a Loebner contestant. You can decide if it is tragedy, farce, or horror.
The first step in entering, (after writing the program), is to submit an application for a kind of first cut. The implication is that only a select few programs get accepted for the contest. I suspect that any program that is even close to reasonable is accepted, but I am not sure.
After being notified that my program was accepted as a contestant, I was sent a schedule of testing dates when my modem should be turned on and I should be available by phone. These testing dates turned out to be only approximate and I was not contacted by phone (as near as I can remember; maybe once or twice). In any event, about two weeks before the contest date, I was assured by email that all was working fine, so I concentrated on final tune-ups to the natural language program; a tricky business because there is a real danger that you will get a wonderful idea at the last minute that you cannot resist trying to implement and end up introducing a disasterous bug into your program.
The day before the contest was a disaster. I came into my office, believing that my safest strategy was to leave the program untouched until the contest was over, but found a message in my email box which said that final testing had found that my program was failing to handshake with their program, using the simple and obvious protocol specified in the official rules, and that I would be disqualified unless the problem was solved instantly.
The difficulty was that their turn-taking protocol sent two carriage returns to signal the end of the judge's turn. And, just as an aside, they would insert a line feed after every carriage return.
I assume that the logic was that this is the protocol that is normally used on human chat systems like IRC.
I discovered, after a couple of hours of frantic debugging, that the problem was not in my program, but in UNIX. The UNIX shell automatically converted every line feed to a carriage return, so my program was receiving a plethora of newline characters and was responding badly to them. I missed the obvious solution, which was to modify my program to wait for four carriage returns in a row to signal an end of turn and spent a frantic day running back and forth to my UNIX administrators trying to figure out if I had to hack the shell, rewrite the termcaps file, or sacrifice a goat under a full moon to convince the system to let the line feeds remain as line feeds. The full moon was not necessary. Two dead goats later, we got the shell to stop converting the linefeeds by modifying the login script to set the right environment variables inside a subshell which then spawned the program. I then had to modify my program to terminate after two linefeeds instead of two carriage returns (in violation of the exact wording of the official protocol, which specified responding to carriage returns and ignoring linefeeds).
I dwell on this, not only because it was so traumatic, but because I know for a fact that I was not the only contestant who had difficulties with the protocol at the last minute. The problem is that we are applications level programmers and the turn-taking protocol relies on transport level interactions over which we have little control. And that, as near as I can tell, the preliminary testing was conducted by sending a bunch of empty carriage returns to the programs without paying attention to the results, so that problems with the turn-taking protocol were not discovered until the final testing.
I have been meaning to suggest to Dr. Epstein that an ASCII string
(maybe
As a consequence, when the day of the contest arrived, I did not have
any idea whether my program has survived the final testing or was
disqualified, so I sat in my office (thankful that I had not traveled
to California to witness the contest, but stayed in Ottawa in case
there were last minute problems) and watched my modem for three hours
to see if they were going to log in.
Log in they did. And hung up. And logged in. And hung up. Finally,
at about noon in my time zone, they phoned to tell me the schedule
for the day. I inferred that my program had not been disqualified and
settled in for an additional ten hours of nail biting over how badly my
program would fare.
I had adopted a high-risk strategy and fully expected my natural
language system to crash and burn ignomiously. For several reasons
unrelated to the contest, I had been putting all my effort into a sex
information system.
Sex is a difficult topic linguistically. It is very broad, covering
everything from how to meet a girl to statistics about herpes
infections. It is also a topic in which synonyms abound. You would be
amazed at the number of synonyms for the female breast. Many of these
synonyms are culture-, age-, and gender-specific. It is also a topic
in which oblique phrasings are de rigure. A phrase like "do it" is
very common, both in a specific context where it refers to a specific
act, and in the general context where it may either refer generally to
sexual activity in or narrowly to sexual intercourse.
Its difficulty makes it the perfect topic to exercise a new natural
language shell. It also makes it a terrible topic for a public
competition. I knew that it would perform badly. All the testing was
done over the Internet (you can telnet debra.dgbt.doc.ca 3000 and ask
about sex if you want to try it yourself). I imagine the typical user
as a young male computer scientist who has a rich sexual fantasy life,
but has never had an actual girlfriend. A typical question that the
sex information system expects to answer is something like, "How do I
find a girl who will rim me." You don't have to be Einstein to know
that no middle-aged woman judge is going to stand in front of a
television camera and type that on a computer terminal. The judges
were from a different subculture, probably had a lot more sexual
experience, and were in a different situation than my intended user
population.
I rationalized my choice of sex as a topic by telling myself that at
least it was the most human topic that I could imagine and that the
judges might be impressed by its broad range of knowledge and
wonderfully detailed, honest, but generally politically correct,
answers. But there was no way on earth that anyone would ever mistake
my program for a live human being.
I had the additional worry that I had deliberatly not told my immediate
supervisor, senior management, or anyone else in the government that I
was working on a sex information system; that I had let approximately
10,000 people call a government computer and ask blunt questions about
sex over the course of four months; and that I was now displaying this
system to the international press without their knowledge or
permission. Even the hint that a question could be raised on the floor
of the House as to why the Department of Industry was providing sex
information to the public without the knowledge, consent, or
participation of the Department of Health would have been sufficient to
shut the project down and force my withdrawl from the contest. The
Official Opposition loses no opportunity to embarrass the government
and the goverment never hesitates to protect itself from potential
embarassment. Even though I managed to get as far as the contest
without being discovered and shut down, the potential political fallout
after the contest made me more than a little nervous.
So I sat and watched the transcripts scroll up my screen and waited to
see it perform disasterously. My most pessimistic predictions were
dead-on.
In my laboratory, our rule of thumb is that natural language
information systems are usable if they answer 50 per cent of questions
appropriately, but they will not be liked. If they exceed 65 per cent,
they will be well-liked, and if they exceed 75 per cent they will be
very well-liked. In the Internet testing, my sex system was exceeding
80 per cent and people were spontaneously indicating that they liked
using it. When the competition started, it performed below 20 per cent
for the first judge, and I thought briefly about unplugging the
modem and pleading unresolvable technical difficulties, or at least
insanity.
I did not take the insanity plea, but hung in there and watched its
performance come up to about 50 per cent by the end of the
competition. I suspect that performance was improving during the
course of the contest because the judges were learning how to ask
questions that were more likely to elicit meaningful answers. Bad as
overall performance was, a detailed look showed an even worse picture.
As I expected, many of the questions were on the periphery of the
topic, a clear consequence of the judges trying to avoid blunt
questions about sex in a public forum, so that fully a third of the
system's answers consisted of an appropriat, "I have no information
about that." Overall only 32 per cent of the questions typed by the
judges elicited correct information from my program.
Just when it looked like things could not get any worse, my program
literally lost its mind. In essence, my program navigates around a
kind of dictionary and, due to a programming error, it was able to
navigate right off the edge. It no longer recognized any words at
all. Fortunately after a thirteen responses of "I cannot give you an
answer to that," to simple, obvious questions from three different
judges, the human referee recognized that it had gone brain dead and
rebooted the system. I did not expect to win any points with those
judges.
The contest organizers had promised to phone and let the losers know
that they had lost so that they would not have to spend the night
waiting for nothing. In my time zone, after 10:00 PM, it looked like
the contest had ended over an hour earlier, and I was already packed up
and waiting for the "Better-luck-next-time" call which would send me
home to commiserate with my family. Instead the caller told me that I
had won. My first reaction was the rather graceless thought that the
other programs must have really bombed if my program's miserable
performance was rated the highest. It did occur to me that maybe I was
the only computer contestant because all the other programs had been
disqualified for failing to respond properly to carriage returns. I
was faced with the immediate problem of trying to sound pleased and
excited in the audio press conference after spending two days and a
night in black depression.
Perversely, I was pleased when I got the final results and saw that the
other programs did not do too badly. My program was the winner by a
technical decision, not a knockout. Only three of the five judges
ranked it highest and one judge ranked it as the worst of the bunch.
There may be hope for AI yet.
Through the rose-tinted filter of hindsight, it was an adventure that I
would not have wanted to miss.
I expect to do a lot better next year.