Progress Update October 2019

This is an update on our progress towards our goals over the last ten months. If you can only read 650 characters of this update, like the judges in our experiments, here’s what you need to know:

  1. We switched from experiments that break down tasks (factored generation) to experiments that break down evaluating expert work (factored evaluation)
  2. 60+ participants have been working 150+ hours per week on our experiments
  3. We're building Mosaic2, an app that streamlines running varied question-answer experiments (factored evaluation, debate, etc.)
  4. We're exploring if language models can automate decompositions, getting 30% accuracy on the Complex Web Questions dataset
  5. William Saunders joined as ML engineer, Jungwon Byun as COO
  6. We’re hiring an engineering team lead and a business operations person. We’ll pay $5000 for a successful referral!

What we do and why

We believe that the most important questions facing humanity are complex and open-ended. These questions range from “What types of policies will effectively curb climate change?” to “How should we deal with the potentially transformative impacts of AI?” and “What career should I pursue?” Despite their importance, such open-ended questions are answered poorly today.

As AI plays a bigger role in society, the world will likely get more complex. It will become even more important to give good answers to questions like these. Unfortunately, AI is not on track to help substantially with answering these open-ended questions. So far, we only know how to use AI to help us with tasks that have clear metrics or fast empirical feedback loops.

Our mission is to make AI just as useful for open-ended questions. Figuring out how to direct the most powerful technologies of our time to the most important questions society wrestles with is a highly leveraged way to have a large, positive impact. Rather than directly tackling climate change, or poverty, or animal suffering, we’re improving the process by which decisions on all of these issues get made.

To apply AI to questions like these, we design, test, and implement mechanisms for delegating open-ended cognitive work to experts who are only trying to optimize clear feedback signals. Our work today involves running experiments with human participants, building web apps to gather data from and structure the experiments, and connecting what we learn from human experiments to ML training. Over time, we'll incrementally automate the work of our human participants and build a platform that deploys ML to answer open-ended questions.

Following up on our December 2018 update

We ended our last update with the following goals for the first half of 2019:

  1. Run more multi-user experiments. Get to a point where we can gather substantial evidence on the feasibility of factored cognition.
  2. Continue our foundations research program, integrating reflection, laziness, distillation, speculative execution, and scheduling via question-answering into a single prototype.
  3. Over time, consolidate Mosaic, Affable, and potentially Relay into a single app.
  4. Fill our open roles: COO, web developer, and experimenter.

We’ve done 1 and 3, parts of 2, and most of 4. We’re still hiring for an engineering team lead!

Experiments with human participants

Breaking down tasks to breaking down evaluation

As of our last update, we were running factored generation experiments. In these experiments, participants break down a complex task into easier tasks, delegate the easier tasks, and use the solutions to these tasks to complete the larger task.

For example, a participant in a factored generation experiment might get the question “What are all of the nouns in the sentence below?” and they would have to return a list of nouns.

Since March of this year, we've switched to running factored evaluation experiments, another instance of factored cognition. Instead of breaking down the original task to complete it, we instead break down the evaluation of solutions to the task.

The factored evaluation version of the question above looks like: “Is ‘dog’ or ‘cat’ a better answer to the question ‘What are all of the nouns in the sentence below?’” and the participant chooses one of the answers.

There are a few reasons why we concluded that factored evaluation is a better research direction for now:

  1. If there are experts in the world that can already generate good answers to questions, we should just use those capabilities as is, without reproducing them using decompositions. In particular, we'd like to use ML systems that are trained end-to-end as long as we can evaluate the quality of their outputs.
  2. In many cases, evaluating solutions is easier than completing the task. It's easier to check that a solution to a Sudoku is indeed a solution than to come up with it in the first place.
  3. After running a few experiments, we realized that breaking down tasks of any interesting complexity using factored generation requires extremely large trees of work. We could get around this by summarizing entire subtrees using experts ("oracles") and focusing on the most difficult subtrees. If we then choose which subtrees to expand using adversaries who win if they can find a case where decomposition doesn't work, we're already close to the factored evaluation setting.
  4. For training ML systems, factored generation on its own is an incomplete solution if we’re working with agents that can't exactly imitate the demonstrated behavior. Factored evaluation defines a reward function that specifies what better/worse behavior looks like. Factored generation only demonstrates intended behavior. To define better/worse behavior for generation, we still need a distance metric that can compare any given agent behavior to this demonstration.

Deployed systems will likely use both factored generation and evaluation, or different mechanisms altogether that combine the best of both worlds, so we expect to test a combination of both in the future.

Our current experimental setup

Our factored evaluation experiments study how a weak human judge can supervise experts who know more and have more time to think. We’re starting with reading comprehension questions where we create an artificial gap between a judge (who doesn’t get to read a text) and experts (who do).

Skipping over a few details, our experiments have the following structure:

  • There is a person, the judge.
  • The judge faces an overall (root) question: “What does the author of this Pitchfork music album review think of the work being reviewed?”
  • This judge is handicapped: they can read at most 650 characters, so they can never read the whole review. Thus, the judge does not have the context required to answer this root question.
  • However, the judge has access to two experts who can read the whole text and who provide two possible answers.
  • Unfortunately, only one of these experts is honest, the other is malicious, and is trying to trick the judge into accepting a wrong but plausible-sounding answer.
  • Without ever seeing the whole text, and only getting information through the experts, the judge must ask follow-up questions to the experts to decipher which answer to the root question is honest and select that one.
  • No one can lie about quotes or quotes’ positions in the text: the quotes from the text are the ground truth anchoring this game.
  • Up to 6 total questions can be asked by the judge before a decision must be made.

Whenever the judge asks the experts a question, this generates a new experiment: Now a different judge must decide which of two expert answers to that question is honest and which is malicious, using the same recursive process. For this to terminate, eventually a judge must choose an answer without asking any subquestions.

What we learned about factored evaluation

Ultimately, we want to show that this approach to evaluating experts’ answers allows the judge to identify the correct answer even when they have limited access to information. To do this, we have to show that the judge selects the honest answer consistently, even when the malicious expert is very strong or has a lot of advantages.

We haven’t yet shown that factored evaluation works reliably in this reading comprehension context. Over the past few months of testing factored evaluation, we’ve learned that:

Good strategies don’t emerge organically

Our factored evaluation experiments were originally structured like a game. Our initial hypothesis was that if we could establish the rules of a game and provide participants with feedback, then strategies for what questions to ask in what context would emerge organically as more people tried to win. While we’ve seen some interesting norms and tactics emerge, we’ve concluded that this is not the most effective way to develop strategies because:

  1. Many of the decomposition strategies that participants tried were overfit to the specific rules and the Pitchfork reading comprehension context (understandably so). But Ought cares about strategies that show promise of generalizing to multiple contexts, especially to evaluating answers to open-ended questions.

  2. It's difficult to come up with good strategies, for both participants and Ought employees. We’ve updated towards thinking that a few people need to spend concentrated time working on this, not intermittent time here and there. Additionally, generating promising strategies benefits from a background in probability and logic.

Some of the strategies we've seen so far (designed for internal consumption):

  1. Claim Validation v0.3 / v0.2 (Erol Can Akbaba)

    Idea: Split each of the expert answers into distinct claims, have the experts provide evidence for each claim and arguments about the relative goodness of evidence.

  2. Malicious Credences (Liam Donovan)

    Idea: Ask the experts to state probabilities on which answer is better, and which aspects of each answer are likely to be misleading; ask them to provide evidence that will most help the judge update their beliefs.

  3. What procedure should I use? (Paul Christiano, William Saunders)

    Idea: Always ask the experts "What procedure should I use to decide which of the two answers is better?" and then implement that procedure.

  4. Factored Justification (William Saunders) / Factored Explanation (Mati Roy)

    Idea: Ask the experts to produce a complete hierarchical argument in a standardized format for each of the answers upfront, then check the two arguments.

While none of these strategies seem sufficient on their own, we think that they all contain seeds for promising future strategies.

Work needs to be ensembled to reduce error rates

No matter how skilled or well-trained a participant is, they will still make mistakes. They might sometimes misinterpret the question or the text and accidentally return or select an incorrect answer to the question they’re working on. If we want to eventually deploy our question-answering systems in high-stakes situations, we can’t afford these mistakes. We’ve realized the need for building systems and processes that aggregate answers across participants to ensure that the malicious expert never wins simply because someone made an avoidable mistake. As a first step, we're looking into ensembling: we give the same subtask to multiple people and then use a voting process to determine which of their answers to go with.

Trees get large in factored evaluation, too

By default, trees don't just get excessively large in factored generation, but in factored evaluation as well. Here is an extreme example of a tree that went on for over 2 months without completing i.e. without an answer being selected at the root level. In this tree, 42 subquestions were asked, with 195 assignments worked on by 22 different participants.

Trees need to finish for us to validate whether a strategy reliably leads to the right answer being selected. To resolve this issue, we rolled out a version of the game we call Most Interesting Branch. In Most Interesting Branch, the honest and malicious expert agree to explore just one path down the question-answer tree, and to yield to the other’s answer for the rest of the tree.

Priorities

Over the next months, we’ll focus on developing strategies in house and testing factored evaluation more modularly. Instead of trying to get participants to come up with promising strategies by playing and improving at a game, Ought employees will devise strategies that we think should consistently select the right answers and generalize beyond the reading comprehension context. We’ll then test strategy execution in a more incremental fashion, starting with single-layer trees and producing robust guarantees at each step along the way.

We value getting feedback on different approaches to experimentation. We’ve assembled an experiment review board of 10 academics, including professors at Stanford, UCSD, Berkeley, Harvard, ANU, and Wharton. We trust their judgement on experimentation and think that such a board will help us run experiments more rigorously as well as broaden the reach of our research. If you have thoughts on how we can run better experiments, reach out to us at experiments@ought.org.

Machine learning projects

Today, machine learning systems are not advanced enough to do open-ended reasoning, so we're primarily running experiments with human participants. Longer-term, we’ll automate the work of participants in the experiments described above, such that the decompositions, expert answers, and answer evaluations are all produced by machine learning systems.

To ensure that our research with human participants doesn’t deviate too far from what is needed in the future to work with ML, and to better estimate when all of this work can be automated, we ran the following projects:

Automating decompositions in narrow domains

Complex Web Questions

First, we took the Complex Web Questions dataset, which contains questions like this:

  • The actress that had the role of Martha Alston, plays what role in Finding Nemo?
  • Which school that Sir Ernest Rutherford attended has the latest founding date?
  • What movies does Leo Howard play in and that is 113.0 minutes long?
  • Where is the end of the river that originates in Shannon Pot?

We built an end-to-end system using GPT-2 that breaks the questions into subquestions, queries Google to answer each of the subquestions, and aggregates the answers back together to answer the original question. Currently, our system answers about 30% of the questions in CWQ correctly.

Numerical estimation

We also started compiling our own dataset of numerical estimation questions, questions like:

  • How many cells are in an adult Paedophryne amauensis frog?
  • If plants stopped "making" O2, how much time of breathable air do humans have?
  • How much do all the world’s beards weigh?

We learned that this dataset needs to be highly structured for GPT-2 to learn how to break down the initial question into subquestions based on human demonstrations. Currently, our data format looks like this:

Question How many cells are in an adult Paedophryne amauensis frog?
Formalization number cells in an adult Paedophryne amauensis frog
A1 volume adult Paedophryne amauensis frog
A2 volume cell in adult Paedophryne amauensis frog
Aggregation A1 / A2

In this dataset, our current ML predictions match human decomposition steps 15% of the time on our validation set.

For both numerical estimation and Complex Web Questions, we view these results as initial (weak) evidence that fine-tuning general-purpose language models on decomposition tasks might be promising. To better understand how true this is, we'd like to study in future work cases where getting decompositions right requires world knowledge that the model has learned in its unsupervised pretraining phase.

Estimating how model performance scales in dataset size

Over time we'd like to estimate quantitatively how much data we need to automate the work our participants do. As a first step in that direction, we explored the effects studied in the Hestness et al (2017) paper “Deep learning scaling is predictable, empirically”. Hestness et al showed that, across a number of domains including image classification, language modeling, and speech recognition, there is a power-law relationship between dataset size and validation loss. That is, to halve validation loss you need some constant k times more data, where k depends on the task.

We replicated their results using transformer models on small decomposition tasks (Complex Web Questions, numerical estimation, math word problems). Calculating k for numerical estimation tasks of different kinds based on small-scale initial data collection helped us converge on the structured data format above. If you're training language models and are deciding what kind of data to collect, you might want to run a similar exercise to estimate ahead of time how much data you'd need to achieve a particular validation loss.

Software engineering

Ought’s engineering team owns Mosaic, the web app we use to gather data from our experiment participants. Mosaic was initially built around the factored generation experiments we mentioned earlier, which makes it suboptimal for our current experiments. We’re excited about running many different types of question-and-answering experiments in the future, so we’ve started working on Mosaic2.

Mosaic2 is a more flexible web app that simplifies setting up varied experiment mechanisms. In Mosaic2, teams can specify the types of interactions they want to have with experiment participants. Without building separate apps, they can easily run factored evaluation, factored generation, or debate experiments.

Mosaic2 is still under development, but we’re excited to launch it soon! If the idea of building an app that structures and aggregates the thinking of a large crowd of participants excites you, check out this opportunity to lead the team building it.

Organization updates

Team

  • William Saunders joined us in May as a Research Intern and decided to extend his time with us through the next year. William is leading the machine learning projects described above.
  • Jungwon Byun joined as COO in June. She runs finance, legal, recruiting, and experiment operations.
  • Our biggest hiring priorities are hiring an Engineering Team lead and someone to join our Operations team.

Collaborators and contractors

The following contractors and collaborators also contributed to our work:

  • Zachary Miller and Andrew Schreiber are supporting Mosaic 1 and 2. They're building features like optimized scheduling algorithms for matching participants to cognitive work, and have tackled architectural challenges like how to concisely specify experiments as state transition functions.
  • Mati Roy is helping manage our contractor network and experiment logistics.
  • Milan Griffes and Ben Goldhaber have helped support various projects on the business operations side, including recruiting, payroll, and visa processing.

Special thanks to:

  • Jeff Wu for helping us finetune GPT-2 for our purposes
  • Beth Barnes and Long Ouyang for providing regular feedback on our experiments
  • Our amazing participants, who are so flexible about testing different rules and approaches, sneak in Easter Eggs of humor into their malicious answers, hold us accountable in our work, and support each other in tough times

New donors

Since December 2018, we’ve received generous donations from the following people and institutions:

  • Paul Christiano, who explains his decisions to donate in this post on LessWrong
  • Ben Delo, advised by Effective Giving
  • The Centre for Effective Altruism’s Long-term Future Fund
  • Nisan Stiennon, who “chose to support Ought because research into factored cognition is a promising way to attack the AI alignment problem”
  • The Future of Life Institute (the second disbursement of a previous award)

Content

  • Andreas gave a talk at EA Global in June, explaining the challenges of delegating open-ended question-answering to experts, whether humans or machines. He argues that questions without easily checkable answers are particularly difficult and yet particularly important, and demos Ought’s experiments designed to make progress on this challenge.
  • Also linked above, Paul Christiano wrote a post on LessWrong explaining why he thinks Ought is “one of the most promising projects working on AI alignment.”
  • William and Owain published a document describing three concrete machine learning projects that researchers interested in Iterated Distillation and Amplification (IDA) might be interested in tackling:
    • Applying IDA to math problems, and solving them by breaking them down into easier math problems
    • Applying IDA to Neural Program Interpretation
    • Using adaptive computation techniques to decide when to rely on a fast distilled model vs. run a more expensive decomposition

How you can help

If you’d like to help with our work, you can:

  • Refer candidates for our Engineering Team Lead role. We’re currently offering a $5,000 referral bonus to the person that introduces us to the right candidate.
  • Introduce us to candidates interested in joining our Operations team. Here’s a letter explaining the role and the type of person we hope to work with.
  • Donate - we’re a 501(c)(3) non-profit and none of the activities above could happen without funders like you!

For more updates like this one, sign up for our newsletter.

Thanks!

This post was published on October 28, 2019 by Jungwon Byun and Andreas Stuhlmüller.