My Experience with the 1994 Loebner Competition

by Thom Whalen


The Loebner Prize Competition is a restricted Turing test to evaluate the "humaness" of computer programs which interact in natural language.

This is true story of my experience as a Loebner contestant. You can decide if it is tragedy, farce, or horror.

The first step in entering, (after writing the program), is to submit an application for a kind of first cut. The implication is that only a select few programs get accepted for the contest. I suspect that any program that is even close to reasonable is accepted, but I am not sure.

After being notified that my program was accepted as a contestant, I was sent a schedule of testing dates when my modem should be turned on and I should be available by phone. These testing dates turned out to be only approximate and I was not contacted by phone (as near as I can remember; maybe once or twice). In any event, about two weeks before the contest date, I was assured by email that all was working fine, so I concentrated on final tune-ups to the natural language program; a tricky business because there is a real danger that you will get a wonderful idea at the last minute that you cannot resist trying to implement and end up introducing a disasterous bug into your program.

The day before the contest was a disaster. I came into my office, believing that my safest strategy was to leave the program untouched until the contest was over, but found a message in my email box which said that final testing had found that my program was failing to handshake with their program, using the simple and obvious protocol specified in the official rules, and that I would be disqualified unless the problem was solved instantly.

The difficulty was that their turn-taking protocol sent two carriage returns to signal the end of the judge's turn. And, just as an aside, they would insert a line feed after every carriage return.

I assume that the logic was that this is the protocol that is normally used on human chat systems like IRC.

I discovered, after a couple of hours of frantic debugging, that the problem was not in my program, but in UNIX. The UNIX shell automatically converted every line feed to a carriage return, so my program was receiving a plethora of newline characters and was responding badly to them. I missed the obvious solution, which was to modify my program to wait for four carriage returns in a row to signal an end of turn and spent a frantic day running back and forth to my UNIX administrators trying to figure out if I had to hack the shell, rewrite the termcaps file, or sacrifice a goat under a full moon to convince the system to let the line feeds remain as line feeds. The full moon was not necessary. Two dead goats later, we got the shell to stop converting the linefeeds by modifying the login script to set the right environment variables inside a subshell which then spawned the program. I then had to modify my program to terminate after two linefeeds instead of two carriage returns (in violation of the exact wording of the official protocol, which specified responding to carriage returns and ignoring linefeeds).

I dwell on this, not only because it was so traumatic, but because I know for a fact that I was not the only contestant who had difficulties with the protocol at the last minute. The problem is that we are applications level programmers and the turn-taking protocol relies on transport level interactions over which we have little control. And that, as near as I can tell, the preliminary testing was conducted by sending a bunch of empty carriage returns to the programs without paying attention to the results, so that problems with the turn-taking protocol were not discovered until the final testing.

I have been meaning to suggest to Dr. Epstein that an ASCII string (maybe , shades of SGML) should be used to signal the end of turn rather than carriage returns. Such a string would be much easier for applications programs to handle, particularly those written in 4th generation languages.

As a consequence, when the day of the contest arrived, I did not have any idea whether my program has survived the final testing or was disqualified, so I sat in my office (thankful that I had not traveled to California to witness the contest, but stayed in Ottawa in case there were last minute problems) and watched my modem for three hours to see if they were going to log in.

Log in they did. And hung up. And logged in. And hung up. Finally, at about noon in my time zone, they phoned to tell me the schedule for the day. I inferred that my program had not been disqualified and settled in for an additional ten hours of nail biting over how badly my program would fare.

I had adopted a high-risk strategy and fully expected my natural language system to crash and burn ignomiously. For several reasons unrelated to the contest, I had been putting all my effort into a sex information system.

Sex is a difficult topic linguistically. It is very broad, covering everything from how to meet a girl to statistics about herpes infections. It is also a topic in which synonyms abound. You would be amazed at the number of synonyms for the female breast. Many of these synonyms are culture-, age-, and gender-specific. It is also a topic in which oblique phrasings are de rigure. A phrase like "do it" is very common, both in a specific context where it refers to a specific act, and in the general context where it may either refer generally to sexual activity in or narrowly to sexual intercourse.

Its difficulty makes it the perfect topic to exercise a new natural language shell. It also makes it a terrible topic for a public competition. I knew that it would perform badly. All the testing was done over the Internet (you can telnet debra.dgbt.doc.ca 3000 and ask about sex if you want to try it yourself). I imagine the typical user as a young male computer scientist who has a rich sexual fantasy life, but has never had an actual girlfriend. A typical question that the sex information system expects to answer is something like, "How do I find a girl who will rim me." You don't have to be Einstein to know that no middle-aged woman judge is going to stand in front of a television camera and type that on a computer terminal. The judges were from a different subculture, probably had a lot more sexual experience, and were in a different situation than my intended user population.

I rationalized my choice of sex as a topic by telling myself that at least it was the most human topic that I could imagine and that the judges might be impressed by its broad range of knowledge and wonderfully detailed, honest, but generally politically correct, answers. But there was no way on earth that anyone would ever mistake my program for a live human being.

I had the additional worry that I had deliberatly not told my immediate supervisor, senior management, or anyone else in the government that I was working on a sex information system; that I had let approximately 10,000 people call a government computer and ask blunt questions about sex over the course of four months; and that I was now displaying this system to the international press without their knowledge or permission. Even the hint that a question could be raised on the floor of the House as to why the Department of Industry was providing sex information to the public without the knowledge, consent, or participation of the Department of Health would have been sufficient to shut the project down and force my withdrawl from the contest. The Official Opposition loses no opportunity to embarrass the government and the goverment never hesitates to protect itself from potential embarassment. Even though I managed to get as far as the contest without being discovered and shut down, the potential political fallout after the contest made me more than a little nervous.

So I sat and watched the transcripts scroll up my screen and waited to see it perform disasterously. My most pessimistic predictions were dead-on.

In my laboratory, our rule of thumb is that natural language information systems are usable if they answer 50 per cent of questions appropriately, but they will not be liked. If they exceed 65 per cent, they will be well-liked, and if they exceed 75 per cent they will be very well-liked. In the Internet testing, my sex system was exceeding 80 per cent and people were spontaneously indicating that they liked using it. When the competition started, it performed below 20 per cent for the first judge, and I thought briefly about unplugging the modem and pleading unresolvable technical difficulties, or at least insanity.

I did not take the insanity plea, but hung in there and watched its performance come up to about 50 per cent by the end of the competition. I suspect that performance was improving during the course of the contest because the judges were learning how to ask questions that were more likely to elicit meaningful answers. Bad as overall performance was, a detailed look showed an even worse picture. As I expected, many of the questions were on the periphery of the topic, a clear consequence of the judges trying to avoid blunt questions about sex in a public forum, so that fully a third of the system's answers consisted of an appropriat, "I have no information about that." Overall only 32 per cent of the questions typed by the judges elicited correct information from my program.

Just when it looked like things could not get any worse, my program literally lost its mind. In essence, my program navigates around a kind of dictionary and, due to a programming error, it was able to navigate right off the edge. It no longer recognized any words at all. Fortunately after a thirteen responses of "I cannot give you an answer to that," to simple, obvious questions from three different judges, the human referee recognized that it had gone brain dead and rebooted the system. I did not expect to win any points with those judges.

The contest organizers had promised to phone and let the losers know that they had lost so that they would not have to spend the night waiting for nothing. In my time zone, after 10:00 PM, it looked like the contest had ended over an hour earlier, and I was already packed up and waiting for the "Better-luck-next-time" call which would send me home to commiserate with my family. Instead the caller told me that I had won. My first reaction was the rather graceless thought that the other programs must have really bombed if my program's miserable performance was rated the highest. It did occur to me that maybe I was the only computer contestant because all the other programs had been disqualified for failing to respond properly to carriage returns. I was faced with the immediate problem of trying to sound pleased and excited in the audio press conference after spending two days and a night in black depression.

Perversely, I was pleased when I got the final results and saw that the other programs did not do too badly. My program was the winner by a technical decision, not a knockout. Only three of the five judges ranked it highest and one judge ranked it as the worst of the bunch. There may be hope for AI yet.

Through the rose-tinted filter of hindsight, it was an adventure that I would not have wanted to miss.

I expect to do a lot better next year.