• YourFirstAI
  • Posts
  • How to Write Perfect AI Summaries of Giant Pieces of Text

How to Write Perfect AI Summaries of Giant Pieces of Text

A few months ago, I built a boilerplate codebase for building apps with AI.

Since then, I’ve worked on expanding this codebase to be more powerful and modular, giving me a good opportunity to observe the challenges of working with AI firsthand.

During my exploration, I encountered a seemingly simple and exceedingly common challenge: reading files. Since hundreds of apps that let you “chat” with your files already exist, I thought this would be light work. Instead, I encountered a fun challenge that spurred a creative solution.

My Stack

  • Express API written in Typescript and compiled with Vite

  • OpenAI API using Langchain JS/TS interface

  • Not important for this but maybe worth mentioning

    • Vercel for local and prod app development

    • Auth0 for login

    • MongoDB Atlas for Database

    • Svelte static site frontend

What Happened.

This story starts with me workshopping an idea for an AI-powered education app. The app included a module for uploading and summarizing course material, which is our starting point.

Token Limits

Every seasoned ChatGPT user will eventually run into a frustrating error which suggests that the text they have provided to the app exceeds the model’s token limit.

While this might not come up in day-to-day usage, this becomes an obvious challenge when working with large bodies text. In education, most readings and textbooks easily surpass this token limit, so most course materials are hard to process with AI.

Token limits not only exist to improve model efficiency, but also to improve accuracy. You might notice how the quality of an AI output will dwindle as it gets longer. Since Large Language Models are just advanced next-word predictors, the farther the output trails from the original input prompt, the more likely it is to become convoluted.

Breaking down and processing the text in smaller chunks seems like the obvious solution, but it also comes with challenges.

Text Chunking and Chunk Overlapping

The easiest way to break down text is to split it into some number of token-sized “chunks.” Since tokens are usually comprised of some average number of words or characters, you can easily accomplish this by breaking large bodies of text into smaller chunks based on that predetermined size.

const blocks = []
for(var i = 0; i < text.length; i+=50000) {
    const block = text.substring(i, i+50000);   
    blocks.push(block);
}

Once the text is broken down into chunks, each of those chunks can be independently managed. However, when text is broken down into arbitrary pieces based on word count, it can be divided at awkward breakpoints. For instance, text may be split in the middle of a word, sentence, paragraph, or section, leading to a loss of important context.

“Chunk overlapping” is a common solution to this issue. Instead of breaking text along clean lines, this technique involves splitting text in a way where each section shares start/end overlap with the prior and following sections. This way, any missing context can be gleaned from that overlap. I used Langchain’s “Text Splitter” functionality to accomplish this.

const splitter = new CharacterTextSplitter({
  separator: '\n',
  chunkSize: 50000,
  chunkOverlap: 1000,
  lengthFunction: (text) => text.length
});
const chunks = await splitter.splitText(myText);

When Chunking Isn’t Enough…

Chunking and chunk overlapping allowed me to process large bodies of text using large language models. However, my original goal was not just to process text, but specifically to get a high-quality summary from that text. Especially with educational information, I found that the breakpoints issued through arbitrary length-based Chunking were detracting from overall performance.

Since my end goal was to automate the process of creating a study guide, I wanted each portion of that study guide to logically correlate to a well-delineated piece of course material. Fortunately, most academic or technical information is already pre-divided into chapters, sections, and paragraphs. Typically, exams are also drawn up on the same lines, so a good study guide will follow these divisions.

So, instead of just chunking text of arbitrary lengths, I wanted my app to find these pre-drawn lines to more logically create sections in its output. If I provided self-contained sections, chapters, and paragraphs to the AI, then the AI would be forced to work with context-rich, independent blocks of text. The output of these blocks could then be used to synthesize a more optimal total summary of a body of text.

Discerning The Natural Sections of Text

After some experimentation, I came up with a clever solution that did exactly what I needed. The basic principle went as follows:

  1. Break up a piece of text using regular Chunk Overlapping.

  2. Have the AI divide each chunk into its natural subsections, and summarize those subsection in detail.

  3. Join the list of chunk summaries into one large list of section summaries. Then, use AI to combine “cutoff sections” (where text splitting broke up a section unnaturallyto eliminate duplicate sections ") and (those which occur completely within two overlaps).

At the end, I got a highly-complete, and far more accurate collective study guide which adequately covers each logical section of material in detail.

const chunkSummarySchema = {
  title: 'Chunk Summary',
  description: `Write a summary of the text chunk. [Additional Refinement Instructions]`,
  type: 'object',
  properties: {
    sections: {
      title: 'Sections',
      type: 'array',
      description: `A list of the logical sections in the text. The text should be broken up into chunks at these breakpoints.`,
      items: {
        type: 'object',
        properties: {
          position: {
            title: 'Position',
            type: 'integer',
            description: `The position of the breakpoint in the text.`
          },
          text: {
            title: 'Text',
            type: 'string',
            description: `A detailed summary of the section.`
          }
        }
      }
    }
  }
};

const chunkRunnable = createStructuredOutputRunnable({
      outputSchema: chunkSummarySchema,
      llm: chatModel,
      prompt: ChatPromptTemplate.fromTemplate(
        `Summarize the contents of the text: {chunk}`
      ),
  outputParser
});

let sections : string [] = []
for (const chunk of chunks) {
	// Repeatedly Call the OpenAI API
  const result = await chunkRunnable.invoke({
    chunk
  });
	const chunkSections = JSON.parse(result).sections
	for(const chunkSection of chunkSections) {
		sections.push(chunkSection.text)
	}
}

const unfilteredText = sections.join("\n-----\n")

const filteredTextSchema = {
	title: 'Chunk Summary',
  description: `Write a summary of the text chunk. [Additional Refinement Instructions]`,
  type: 'object',
  properties: {
    summary: {
			title: 'Summary',
			type: 'string',
			description: 'A Condensed Summary of the Text Sections. Combine sections which cover the exact same topic but seem to be missing context. Eliminate duplicate sections. Keep the original format of section divisions.'
		}
	}
}

const filteredTextRunnable = createStructuredOutputRunnable({
      outputSchema: filteredTextSchema,
      llm: chatModel,
      prompt: ChatPromptTemplate.fromTemplate(
        `Summarize the contents of the unfiltered text: {unfilteredText}`
      ),
  outputParser
});

const result = await filteredTextRunnable.invoke({
	unfilteredText
});
// Process result