Thom's 1995 Loebner Competition Experience

Thom's Participation in the Loebner Competition 1995
or
How I Lost the Contest and Re-Evaluated Humanity

by Thom Whalen

The Loebner Prize Competition is a restricted Turing test to evaluate the "humaness" of computer programs which interact in natural language.

In 1994, I won the Loebner Prize Competition in San Diego, California by a hair. In fact, considering how poorly my program performed, I am still surprised that I won at all.

In 1995 the competition was to be held in New York on 16 December.

I felt that I could not help but do better.

The 1995 rules were changed from previous competitions. For the first time, the judges would be permitted to ask any questions that they wanted, rather than being restricted to a particular topic for each program. This was intended to make the competition more difficult.

In order to accomodate the "no-topic" rule, I decided that the best approach would be to try to model a human being. I would not simply try to answer questions, but would try to incorporate a personality, a personal history, and a unique view of the world. In short, I had to invent a person.

This may sound daunting until you realize that people have been constructing models of human beings for centuries; every novel or a play is populated with invented people. For example, Sir Arthur Connan Doyal created a complete personality, personal history, and unique world view for Sherlock Holmes which was so compelling that many people believe he was an historical figure.

The only difference was that I would have to make a character which would respond to a variety of inputs. I had done this before in a simulation of a conversation with a university undergraduate (On the Internet, "telnet debra.dgbt.doc.ca 3000" and ask to talk to Alice). This time, with more experience and newer, more powerful software, I could surely make an even better simulation.

To limit the conversation, I decided to create a character who had a fairly narrow world view; who was only marginally literate and, therefore, did not read books or newspapers; and who worked nights, and, therefore, was unable to watch prime-time television. Furthermore, to provide some direction for the conversation to develop to try to capture the judge's attention, I created a minor mystery plot. He would be a janitor who was about to lose his job. By conversing with him, you could find out that he was actually the victim of a deliberate slander and learn enough to tell him how to keep his job.

I spent three months writing the conversation and testing it on the Internet ("telnet debra.dgbt.doc.ca 3000" and ask to speak to Joe).

As the deadline approached, I had second thoughts about entering the 1995 competition at all. Unlike the previous four competitions, in 1995, programs could not participate through any kind of communications medium. They would be required to run on site at the Salmugundi Art Club in New York city.

My program has been developed on a Sun SPARC workstation, and would not run on a PC. I did not relish the thought of trying to carry a SPARC to New York. They do not work well when they are not connected to a network, they do not fit under an airplane seat, and I did not want risk having my primary development platform stuck in some customs broker's office for weeks while I missed the competition.

Hugh Loebner agreed that I could enter the competition contingent upon him supplying a computer for me to use. And he did. Sun computers agreed to lend a SPARC workstation to the competition.

My program, Joe the Janitor, was in. I was committed.

As the date for the contest approached, I devoted a couple of weeks to implementing the mundane technical details that would be required for the competition.

I decided that the easiest way to configure my entry was to have the SPARC communicate with a PC via the serial port. That way the apperance of the screen would be identical to the human confederates. An added bonus was that the PC would take responsibility for collecting the transcripts in the required format.

All I had to do was make my program communicate through the serial port on a standalone SPARC using the communications protocol specified for the contest.

Yeah, right.

To get control of the serial port, I poured through the UNIX technical manuals to learn all about "ioctl()" and "termio.h" and "non-cannonical mode" and other mysterious UNIX incantations.

I also poked and prodded Loebner's communications program to learn all about double carriage returns and "CCC99" handshakes and other arcane rites.

Next, I had to learn all about how Sun Workstations are administered in stand-alone mode. Sun's motto is "The computer is the network." My worst nightmare was traveling all the way to New York and then finding that I could not get the SPARC running properly. I thought that Sun would probably deliver a computer that worked in standalone mode, but I could not risk being caught unawares if their machine expected to find a network plugged into the ethernet port. So I learned about more obscure UNIX incantations called, "boot -s" and "localhost 127.0.0.1" and "hostname.xx0".

Finally, I had to introduce realistic keystroke delays, typing errors, and thoughtful pauses into the output of my program. Unlike the previous year, the judges would be seeing the output of the program displayed character by character. The program not only had to appear to understand English, it had to look like there was a human being typing the answers.

Armed with my program disk, a sheet of instructions for configuring UNIX in standalone mode, another sheet of instructions for communicating with Loebner's program, and my own cables and manuals -- just in case -- I drove to the Ottawa airport on Thursday afternoon.

I toted my suitcase across the airport parking lot in -20 C, a wind blowing steadily at 30 km/hr, and 5 cm/hr snow accumulation -- in technical terms, a Canadian mid-winter blizzard -- wondering if the airplane would be able to take off at all. But my fears were for nought. Air Canada was not about to be deterred by a little adverse weather.

In New York, I found that their balmy +3 C with no percipitation was too warm for my eiderdown parka with the fur-rimmed snorkle hood.

Something else to make me sweat through the next two days.

Friday morning I found the Salmugundi Club with the help of my cab driver ("What? Fifth Avenue? Where on Fifth Avenue? You don't know the cross street? Are you sure you don't know the cross street? How about guessing. Maybe 47th street? Does that sound right? No? Well pick a cross street!") and found Hugh Loebner waiting for me. Let me make this perfectly clear. I found only Hugh Loebner was waiting for me. ("Staff? Help? No, there's no one else. I'm running this contest myself. Like Blanche in 'A Streetcar Named Desire,' I'm relying on the kindness of strangers.")

He favored western style string ties. ("Howdy, Stranger!").

There was a room stacked high with some thirty-odd crates. Sure enough two of these crates held a SPARCstation and monitor and the rest held IBM PCs and monitors. None of these crates held null modem cables. None of these crates held power bars. And none of these crates held the video multiplexers necessary to show the contest to the audience.

Fortunately, I brought my own null modem and cables. Hugh went out and bought a couple of power bars.

The SPARCstation that Sun delivered was perfectly configured. There was no need for "boot -s" or "localhost 127.0.0.1" or even "mv hostname.le0 hostname.xx0". Twenty minutes later Joe the Janitor and I were on speaking terms.

Hugh Loebner got FRED, Robby Garner's program, running and announced that we had a contest. Even if no other contestants or confederates showed up, we still had two computer programs that could compete against each other.

I was not going to win by default.

I spent the rest of the day fiddling with Joe. Tweaking this and twitching that; uncertain whether I was improving his performance or introducing more bugs. But I was too nervous to leave him alone.

On Saturday Joe the Janitor would face Joseph Weintraub's program, the PC-Therapist, which had won the first three Loebner Competitions. Though Joseph and I had both won Loebner medals in previous years, we had never competed in the same competition.

The courier did not arrive with the promised cables. Hugh went out and bought more power bars. He had some new null modem cables custom made.

That evening, two other competitors, Philip Maymin and Joseph Weintraub arrived at the club. We ate a Christmas dinner, played some pool (Hugh prefered a game called "cowboy" ("Howdy, Stranger!") I won our game. At least I can claim that I won something in New York), and made sure that the other programs worked. Now we had a four-way contest. As well, the courier finally delivered the video cables and null modems, so we would have a contest that an audience could see.

The next morning, bright and early (8:15 AM), we started setting up the room for the competition. Unfortunately, the competition was held in the same room as the Christmas dinner, so there was no way to set up the computers before the day of the contest. Hugh ("I depend on the kindness of strangers"), Philip, Jose the superintendent, and I rolled up our sleeves and began uncrating the IBM PCs and carrying them up the stairs. I dearly wish elevators had been invented a hundred and fifty years ago when the Salmugundi Club was founded.

To be honest, I rather enjoyed helping set up the computers. It gave me something more productive to do during those 24 hours than to sit and stew about how Joe would perform.

In three hours Philip, Hugh, and I managed to set up a dozen PCs, one SPARC, and twenty monitors, install the communications software everywhere, yoke the right machines together with the null modems, and install curtains to separate the judges from the confederates and the audience.

The confederates were lead to their terminals and instructed. The judges were introduced. The competition was begun. Judges typed questions for fifteen minutes on each terminal and programs and confederates responded. The audience watched. Philip and I watched. Joseph Weintraub spent most of his time in the club lounge, cool and confident.

In the second round, the judges were given an additional five minutes to query any terminal that they were uncertain about. None of the judges bothered trying Joe a second time. I knew that was a bad sign.

Finally, the judges were asked to rank-order each terminal from most to least human.

The results were tallied and Hugh announced the winner: "Joseph Weintraub."

I lost.

Actually I came in second, but losing to Joseph Weintraub was still losing.

Robby Garner from Robitron came in third. He was at a clear disadvantage because his program, FRED, ran from DOS so his screen looked different from the other seven screens.

Philip Maymin's strategy was to minimize the judges' opportunity to interact with his program. It produced very long output at a painfully slow typing speed. Many judges only had an opportunity to ask a single question and we only saw about three different answers during the whole contest. Cute idea. The judges were not impressed.

After the competition, we talked to the judges. They were mostly from the media and unanimously agreed that they enjoyed being judges. They were in no hurry to leave and the journalists among them took the time to interview everyone in sight.

I was disappointed that Weintraub won again, but the rules were clear and he won fairly.

There are lessons for me to learn. Several of my hypotheses were disproved. Or at least cast into strong doubt.

First, I had hypothesized that the number of topics that would arise in an open conversation would be limited. If you look at Dale Carnegie, an expert in making small talk to strangers, he states that strangers talk about (in this approximate order):

their names
where they live
where they used to live
people that they know in common
the weather
sports
politics
books, television, movies and music
hobbies

I believe that he is correct and I programmed Joe to have some response for common questions on each of these topics.

My error was that the judges, under Loebner's rules, did not treat the competitors as though they were strangers. Rather, they specifically tested the program with unusual questions like, "What did you have for dinner last night?" or "What was Lincoln's first name." These are questions that no one would ever ask a stranger in the first fifteen minutes of a conversation.

Robby Garner's program, FRED, encountered the same problem for about the same reason. It was prepared to answer questions about various aspects of his personal life, but the judges never asked any questions which produced those answers.

Second, I hypothesised that, once the judge knew that he was talking to a computer, he would let the computer suggest a topic. I do not believe that any existing computer program can seriously pretend to be a human being for more than a half-dozen interactions, so I consider the human confederates to be a red herring. I believe that the real issue is whether my program appears more human than the other programs.

Thus, my program tried to interest the judges in Joe's employment problems as soon as possible. This would lead most quickly to the richest interactions because this was the part of the program that had been most highly developed.

I was surprised to see how persistant some judges were in refusing to ever discuss Joe's job. It seemed that the judges would rather see the program reply, "I don't know," twenty times in a row to various strange questions than to get reasonable responses to questions about why he is worried about losing his job. I guess they really wanted to hammer home the point that Joe is not a human being.

Third, I hypothesized that the judges would be more tolerant of the program saying, "I don't know." than of a non-sequiter. Thus, rather than having the program make a bunch of irrelevant statements when it could not understand questions, I simply had it rotate through four statements that were synonymous with "I don't know."

Weintrab's program, however, was a master of the non-sequiter. It would continually reply with some wildly irrelevant statement, but throw in a qualifying clause or sentence that used a noun or verb phrase from the judge's question in order to try to establish a thin veneer of relevance.

I am amazed at how cheerfully the judges toleranted that kind of behaviour. I can only conclude that people do not require that their converstational partners be consistent or even reasonable.

But I am not ready to draw any conclusion about whether this is a fundumental problem with the Turing test. Remember that we are talking about conversational partners that are fairly quickly recognizable as computer programs. To appear completely human, I would expect (hope) that the program would have to be much more responsive to the questions that were asked.

Fourth, I hypothesized that a critical component of "humanness" was personality. I felt that it was important that my program have a consistant and identifiable personality.

I think I was successful in this. In discussion with the judges after the contest, when I confessed to being Joe's creator, one of the judges said that he thought Joe had the best defined personality. I then asked if he rated Joe as the most human computer and he said, "No." I probed and prodded for a few minutes, but he could not explain why he thought that humanness was different than having a human personality.

I am still puzzled about that.

The failure of all four of my hypotheses leaves me in a quandry.

I believe that I could modify Joe to beat Weintraub simply by replacing the "I don't know" part of the program with a little Weintraub/ELIZA style routine. I estimate that it would take about two weeks of effort to produce a routine that would be adequate for my purposes, though much less sophisticated than Weintraub's. Then my program would still answer all of the questions that it already does, but, when it encountered an unfamiliar question, it would never say "What?" or "I don't know." Rather, it would just introduce a new topic. Not as smoothly as Weintraub's program, but smoothly enough.

But I don't know if I want to do that. Making that modification would have no purpose whatsoever, apart from winning the next competition. The primary goal of my TIPS software, like Robby's FRED, is to create useful information systems such as computer help systems, not to win Loebner's competition. I do not believe that Weintraub's approach, which follows in the footsteps of Joseph Weisenbaum's ELIZA, will ever lead to a useful way to deliver factual information.

Lying awake in the middle of the night in my hotel room after losing the competition, I wrote out a list of eight major enhancements to TIPS which would make it a more powerful information delivery system. I would much rather spend my time implementing these enhancements than re-implementing an enhanced ELIZA that would not be good for anything.

As well, I am philosophically enamoured with the idea of writing a program which models a real human being rather than a program which simply tries to field random questions. And I am philosophically opposed to writing a program that performs syntactic tricks without any interesting semantics (or more specifically, semantic-based pragmatics). I would rather keep trying to model a human being that will do better than Weintraub's program than to beat him at his own game. If I have to resort to ELIZA's tricks, then I will be admitting a fundumental flaw in my own approach.

The next contest will be held in April '96. I enjoyed entering the Loebner competition immensely in the last two years. I would encourage everyone working in natural language to think about entering the next one. If for no other reason than to show that ELIZA-style programs are not the epitomy of natural-language processing.

For myself, I do not know what the future will bring. Except that I do know that I will keep developing my software. And I will start thinking seriously about what it means to judge a conversational partner as "human."

Thom's Participation in the Loebner Competition 1995 or How I Lost the Contest and Re-Evaluated Humanity

by Thom Whalen

Thom's Participation in the Loebner Competition 1995
or
How I Lost the Contest and Re-Evaluated Humanity