Extracting Structured Data from PDF Files using OpenAI & Langchain

 


blog


Aim


The aim of this project is to create a paper parser that extracts questions from an exam file and parses it into JSON format leveraging langchain and openAI.


Technologies Used:-


The code is written in JavaScript and utilizes the following technologies:


  • dotenv: A module for loading environment variables from a .env file.

  • zod: A TypeScript-first schema validation library.

  • Langchain : A framework designed to simplify the creation of applications using large language models.

  • fs: A built-in Node.js module for working with the file system.

  • Key Challenges:-


    Here are the key challenges that we faced in the code and some tips to overcome them:


    1. Writing the Prompt:


    One challenge was writing a prompt that guides the OpenAI language model to extract the desired information accurately without making up questions or generating irrelevant responses. To address this challenge, here are some tips:


  • Clearly specify the format and structure of the expected output.

  • Include explicit instructions on how to handle different scenarios (in our case, such as when options are present or absent, or when explanations are empty)

  • Provide examples and clear guidelines to illustrate the desired output format.

  • 2. Splitting Questions based on Regex:


    Another challenge was to correctly split the text into individual questions based on a regex pattern. In this code, the regex pattern questionPattern is used to identify and extract questions from the PDF text. To overcome this challenge:


  • Ensure the regex pattern accurately captures the desired structure.

  • Test the pattern against different question formats to ensure it matches correctly.

  • Consider edge cases, such as variations in question numbering or formatting, and adjust the regex pattern accordingly.

  • Validate and handle cases where the regex pattern may fail to match or extract the desired questions.

  • 3. Writing a Perfect Schema:


    The code utilizes the Zod library to define a schema for structured output. The schema describes the expected structure and validation rules for the extracted questions. Writing a perfect schema involves:


  • Carefully defining the schema based on the expected properties, types, and constraints of the extracted questions.

  • Consider all possible variations in the data.

  • Validate and handle edge cases, such as missing fields, optional fields, or unexpected input.

  • Regularly test the schema with different scenarios and inputs to ensure it captures all possible cases and provides accurate validation.

  • Our Approach:


    The overall approach to extract questions from a PDF file in JSON format can be summarized as follows:


  • Load the PDF document.

  • Extract the text content from each page and concatenate it.

  • Find all matches of the question pattern in the concatenated text.

  • Iterate through the matches and extract the questions along with their respective question numbers.

  • Create a parser for structured output using Zod schema to validate and parse the extracted information.

  • Create a prompt template with instructions and input variables for generating responses from the OpenAI language model.

  • Generate responses from the model for each extracted question.

  • Parse the responses using the output parser and store the extracted output in an array.

  • Append the extracted output to a JSON output file.

  • Implementation:


    Before getting started, you’ll need an API key from OPENAI. You can find your Secret API key in your User settings.


    Step 1: Import Required Modules and Libraries


    Import Required Modules and Libraries


    In this step, we import the necessary modules and libraries required for our code. These include modules for environment variable management, schema validation, interacting with OpenAI language models, generating prompts, parsing structured outputs, loading PDF documents, and working with the file system.


    Step 2: Create a PDF Loader Instance


    const inputFilePath = Exam_Docs/${process.argv[2]};

    const loader = new PDFLoader(inputFilePath);


    We define the inputFile and create an instance of the PDFLoader class, which will be used to load the PDF document and extract its text content. Here, we set the input file path based on the command line argument provided when running the script. The argument specifies the PDF file to process.


    Step 3: Define Question Pattern


    const questionPattern = /\b[Qq][Ee][Tt][Oo]:?\s+(\d+)\b/g;


    Here, we define a regular expression pattern that matches the question tag followed by a number. This pattern will be used to identify and extract the questions from the PDF text.


    Step 4: Load the PDF Document


    const doc = await loader.load(inputFilePath);


    We use the PDFLoader instance to load the PDF document specified by the input file path. The loader.load() method returns a Promise, so we use the await keyword to asynchronously wait for the document to be loaded.


    Step 5: Extract Text Content from Pages


    let allText = "";

    for (let i = 0; i < doc.length; i++) {

    allText += doc[i].pageContent;

    }


    In this step, we iterate through each page of the loaded PDF document and extract the text content from each page. The extracted text content is then concatenated and stored in the ‘allText’ variable


    Step 6: Find Question Matches in Text


    const matches = [...allText.matchAll(questionPattern)];

    let extractedQuestions = [];

    for (let i = 0; i < matches.length; i++) {

    const match = matches[i];

    const questionNumber = match[1];

    const startIndex = match.index + match[0].length;

    const endIndex = i === matches.length - 1 ? undefined : matches[i + 1].index;

    const questionText = allText .substring(startIndex, endIndex).trim() .replace(/\s+/g, " ");

    const fullQuestion = Question ${questionNumber}: ${questionText};

    extractedQuestions.push(fullQuestion);


    Here, we use the matchAll() method on the allText string to find all matches of the question pattern. We iterate through the matches and extract the question number, start index, and end index for each match. We then extract the question text using the start and end indices, trim any extra whitespace, and store the full question (including the question number) in the extractedQuestions array.


    Step 7: Create a Parser for Structured Output


    const parser = StructuredOutputParser.fromZodSchema(

    z

    .array(

    z.object({

    no: z.number().int().positive().describe("Question number"),

    question: z.string().nonempty().describe("Question"),

    options: z

    .object({

    A: z.string().nonempty().describe("Option A"),

    B: z.string().nonempty().describe("Option B"),

    C: z.string().optional().describe("Option C"),

    D: z.string().optional().describe("Option D"),

    E: z.string().optional().describe("Option E"),

    F: z.string().optional().describe("Option F"),

    G: z.string().optional().describe("Option G"),

    H: z.string().optional().describe("Option H"),

    })

    .strict()

    .describe("Options"),

    ans: z

    .array(z.enum(["A", "B", "C", "D", "E", "F"]))

    .min(1)

    .describe("Answers"),

    explanation: z.string().optional().describe("Explanation"),

    result: z.string().nonempty().describe("result"),

    })

    )

    .nonempty()

    .describe("Array of exam questions")

    );


    In this step, we create a parser for structured output using the Zod schema library. The schema defines the structure and validation rules for the extracted exam questions. The parser will be used to parse and validate the responses generated by the OpenAI language model.


    Step 8: Get Format Instructions for the Parser


    const formatInstructions = parser.getFormatInstructions();


    We retrieve the format instructions for the parser, which provides guidance on how to structure the input and output data for the parser to work correctly. These instructions will be used in the prompt template.


    Step 9 : Create a Prompt Template


    const prompt = new PromptTemplate({

    template: "Extract information as it is from text.\n{format_instructions}\nExtract the 'options' text from before the 'explanation' text section and check options are not explicitly written in the prompt in the format 'A. option_a_text', 'B. option_b_text', 'C. option_c_text', etc., set the 'options' field as an empty object and provide the 'result' field as 'failed'. If options are present in correct format in the prompt, extract the correct 'ans' field with valid options (a, b, c, d, e) and provide the 'result' field as 'success'.\nHere is an example to understand what to do in case options are not explicitly available in the question description: \nExample question description: '1. Which cities are present in India? Explanation: Delhi because Berlin is in Germany.'\n Output : ['no' : 1, 'question': 'Which cities are present in India', 'options' : [], ans : , 'explanation' : 'Delhi because Berlin is in Germany', 'result' : 'failed']\n\nIf the explanation is empty in question description set the 'explanation' field as empty string in response. For example, if the question description is '1. What is the captial of France? A. Paris B. Berlin Answer: A Explanation:' \n Output : ['no' : 1, 'question': 'What is the captial of france?', 'options' : ['A' : 'France', 'B' : 'Berlin'], ans : (A), 'explanation' : '', 'result' : 'success']\nPlease output the extracted information in JSON codeblock format. Do not output anything except for the extracted information. Do not add any clarifying information. Do not add any fields that are not in the schema. All output must be in markdown JSON codeblock and follow the schema specified above.\nQuestion description: {inputText} ",

    inputVariables: ["inputText"],

    partialVariables: { format_instructions: formatInstructions },

    });


    We create a prompt template using the PromptTemplate class. The template string contains instructions and variables that will be filled in dynamically. In this case, the inputText variable will be replaced with the extracted questions, and the format_instructions variable will be replaced with the retrieved format instructions. Remember to add examples to the prompt so that the model understands the instructions better.


    Step 10: Create an OpenAI Model Instance


    const model = new OpenAI({

    temperature: 0,

    model: "gpt-3.5-turbo-0613",

    maxTokens: 2000,

    reset: true, });


    We create an instance of the OpenAI language model using the OpenAI class. The instance is configured with the desired settings, such as temperature, model version, maximum tokens, and reset behavior.


    Step 11: Generate Responses for Each Question


    let extractedOutput = [];

    let totalRounds = extractedQuestions.length;

    let currentQuestionNumber = 0;

    for (let i = 0; i < totalRounds; i++) { currentQuestionNumber = i + 1; const input = await prompt.format({ inputText: extractedQuestions[i], });

    // Code for generating and parsing the response }


    In this step, we iterate through each extracted question and generate responses for them using the OpenAI language model. We format the input by replacing the inputText variable in the prompt template with the current question. The generated response will be parsed and processed in the subsequent steps.


    Step 12: Generate response and Parse and Process it

    for (let i = 0; i < totalRounds; i++) {

    currentQuestionNumber = i + 1;

    const input = await prompt.format({

    inputText: extractedQuestions[i],

    });

    let response;

    try {

    // Generate response from the model

    response = await model.call(input);

    try {

    // Parse the response using the output parser

    let output = await parser.parse(response);

    extractedOutput.push(output[0]);

    } catch (err) {

    let errObj = {

    no: currentQuestionNumber,

    result: "failed",

    };

    extractedOutput.push(errObj);

    }

    } catch (err) {

    let errObj = {

    no: currentQuestionNumber,

    result: "failed",

    remark: "failed to generate response", };

    extractedOutput.push(errObj); } }


    We generate a response from the OpenAI language model by calling the call() method on the model instance. The input parameter contains the formatted prompt for the current question.We parse the response using the structured output parser we created earlier. If the parsing is successful, the parsed output is stored in the extractedOutput array. If an error occurs during parsing, an error object indicating the failed question number is added to the extractedOutput array.


    Step 13: Save Extracted Output to a File


    const outputFilePath = "extracted_output.json"; fs.appendFile(finalOutputFilePath, JSON.stringify(finalOutput), (err) => { if (err) console.log(err); });


    Finally, we save the extracted output to a JSON file. The extractedOutput array is serialized to JSON format using JSON.stringify(), and then written to a file using the appendFile() method from the fs module.


    Conclusion


    In conclusion, the provided code demonstrates an approach to extracting questions from a PDF file and generating structured output in JSON format. By leveraging various technologies such as the OpenAI language model, Langchain, and the Zod library for schema validation, the code achieves the desired goal of extracting questions and relevant information from the PDF.


    By following the step-by-step breakdown of the code, addressing the challenges faced, and incorporating the provided tips, developers can gain insights into how to approach similar tasks involving information extraction from PDF files. With further customization and enhancements, this code can serve as a foundation for building more advanced question extraction systems or integrating it into larger applications.


    Overall, the code exemplifies the power of combining natural language processing models, pattern matching techniques, and structured output parsing to automate the extraction of relevant information from unstructured documents, opening up possibilities for various applications in fields such as education, data analysis, and information retrieval.

    Comments

    Popular posts from this blog

    Building a Cryptocurrency Exchange Platform: Key Considerations & Best Practices

    Blockchain Security: Safeguarding the Decentralized Future

    Build Dynamic Websites With Jamstack Web Development