the crowdsourced corpus. 100 additional stories were
uniformly sampled from all possible stories that could be
generated by the plot graph; the most frequent sentence of
the underlying natural language cluster was selected to
describe an event. 300 participants edited generated stories
such that three judges saw each story.
We take the number of edits as an indication of story
quality; less edits indicate better quality. The number of
events added, deleted, or moved are shown in Table 1.
Welsh t-tests were used to determine if the difference
between human-authored and computer generated stories is
statistically significant at p < 0.05. Plot graphs merge
elements from many stories, so computer-generated stories
are on average nearly twice as long as human-authored
stories. We also report mean additions, deletions, and
reorderings after normalizing for story length.
We find the difference between human-authored and
computer-generated stories to be insignificant in many
measures, which is quite suggestive given the scale of the
evaluation. No significant differences exist in number of
added events. We conclude that S
CHEHERAZADE does not
omit essential events any more often than human authors.
We find a small but statistically significant difference in
the number of events deleted from computer-generated
stories. Although statistically significant, the mean
difference between conditions is less than one deleted
event. The significance vanishes when two events “wait in
line” and “open door” are withheld. The two events
account for 64.5% of deletes, and occurred rarely in the
corpus (6 and 9 times out of 60), which explains why the
system has less certainty about their inclusion. We
conclude that despite the existence of multiple
incompatible alternatives throughout the plot graph, the
system does not add events that contradict human intuition.
We attribute this result to the use of mutual exclusion
relations to successfully separate incompatible events. The
system tells “verbose” stories, containing events that are
correct but unnecessary for human comprehension, which
needs to be addressed by future work.
The reordering of events is a consequence of under- or
over-constrained temporal precedence relations. The fact
that these events are not deleted indicates that they
contribute to the story but are not ideally positioned.
Moves may be due to preferences for event ordering or
errors in the plot graph. 32.3% of moves are consistent
with the plot graph, indicating that the reordering exists in
another legal story and the judge may be indicating a
preference for another story that can be generated.
The rest of the moves violate temporal constraints in the
plot graph, which may indicate a preference for an ordering
that was over-constrained rather than a severe error. For
example, pressing the alarm is over-constrained to occur in
the second half of the story. Two moved events account for
a plurality of inconsistencies: “get in car” is uniformly
moved from end to beginning; “Sally cries” is a rare event
in the corpus. Removing these two events reduces the
inconsistencies to 44% of generated stories. Over-
commitment to temporal relations does not necessarily
imply that the plot graph or subsequently generated stories
are nonsensical or incoherent.
Conclusions
In this paper, we present an open story generation system,
S
CHEHERAZADE, which tackles the challenge of creating
stories about any topic that humans can agree upon how to
describe. The system achieves this by first learning a
model—plot graph—of the given topic from crowdsourced
examples. We leverage emerging techniques for learning
from stories. We extend these techniques by recognizing
when events are mutually exclusive, thus ensuring the
system cannot generate stories that conflate distinct
variations. S
CHEHERAZADE generates stories by
stochastically sampling from the space of stories defined
by the plot graph, finding a sequence of events that does
not violate any constraints. A large-scale evaluation
confirms that generated stories are of comparable quality to
narratives written by untrained humans.
To date the reliance of story generation systems on a
priori known domain models has limited the degree to
which these systems can be said to possess narrative
intelligence. Open story generation overcomes knowledge
engineering bottlenecks by demonstrating that a system
can learn to tell stories by using a crowd of anonymous
workers as a surrogate for real-world experiences. We
believe that this is a crucial step toward achieving human-
level computational narrative intelligence.
Acknowledgements
We gratefully acknowledge the support of the U.S.
Defense Advanced Research Projects Agency (DARPA).
Special thanks to Alexander Zook.
Table 1. Experiment results.
Human Comp.
Mean original length 12.78 23.14
Mean final length 11.82 21.89
Mean events added 0.33 0.49
§
Mean events added (normalized) 0.03 0.02
§
Mean events deleted 0.30 0.76
Mean events deleted (normalized) 0.02 0.03
§
Mean events deleted (2 events withheld) 0.28 0.27
§
Mean events deleted (2 withheld, norm.) 0.02 0.01
§
Mean events moved 0.53 3.57
Mean events moved (normalized) 0.04 0.15
* The difference is statistically significant (p < 0.05)
§ The difference is not statistically significant (p ≥ 0.05)