Extracting Structured Data from PDF Files using OpenAI & Langchain
Aim
The aim of this project is to create a paper parser that extracts questions from an exam file and parses it into JSON format leveraging langchain and openAI.
Technologies Used:-
The code is written in JavaScript and utilizes the following technologies:
Key Challenges:-
Here are the key challenges that we faced in the code and some tips to overcome them:
1. Writing the Prompt:
One challenge was writing a prompt that guides the OpenAI language model to extract the desired information accurately without making up questions or generating irrelevant responses. To address this challenge, here are some tips:
2. Splitting Questions based on Regex:
Another challenge was to correctly split the text into individual questions based on a regex pattern. In this code, the regex pattern questionPattern is used to identify and extract questions from the PDF text. To overcome this challenge:
3. Writing a Perfect Schema:
The code utilizes the Zod library to define a schema for structured output. The schema describes the expected structure and validation rules for the extracted questions. Writing a perfect schema involves:
Our Approach:
The overall approach to extract questions from a PDF file in JSON format can be summarized as follows:
Implementation:
Before getting started, you’ll need an API key from OPENAI. You can find your Secret API key in your User settings.
Step 1: Import Required Modules and Libraries
In this step, we import the necessary modules and libraries required for our code. These include modules for environment variable management, schema validation, interacting with OpenAI language models, generating prompts, parsing structured outputs, loading PDF documents, and working with the file system.
Step 2: Create a PDF Loader Instance
const inputFilePath = Exam_Docs/${process.argv[2]};
const loader = new PDFLoader(inputFilePath);
We define the inputFile and create an instance of the PDFLoader class, which will be used to load the PDF document and extract its text content. Here, we set the input file path based on the command line argument provided when running the script. The argument specifies the PDF file to process.
Step 3: Define Question Pattern
const questionPattern = /\b[Qq][Ee][Tt][Oo]:?\s+(\d+)\b/g;
Here, we define a regular expression pattern that matches the question tag followed by a number. This pattern will be used to identify and extract the questions from the PDF text.
Step 4: Load the PDF Document
const doc = await loader.load(inputFilePath);
We use the PDFLoader instance to load the PDF document specified by the input file path. The loader.load() method returns a Promise, so we use the await keyword to asynchronously wait for the document to be loaded.
Step 5: Extract Text Content from Pages
let allText = "";
for (let i = 0; i < doc.length; i++) {
allText += doc[i].pageContent;
}
In this step, we iterate through each page of the loaded PDF document and extract the text content from each page. The extracted text content is then concatenated and stored in the ‘allText’ variable
Step 6: Find Question Matches in Text
const matches = [...allText.matchAll(questionPattern)];
let extractedQuestions = [];
for (let i = 0; i < matches.length; i++) {
const match = matches[i];
const questionNumber = match[1];
const startIndex = match.index + match[0].length;
const endIndex = i === matches.length - 1 ? undefined : matches[i + 1].index;
const questionText = allText .substring(startIndex, endIndex).trim() .replace(/\s+/g, " ");
const fullQuestion = Question ${questionNumber}: ${questionText};
extractedQuestions.push(fullQuestion);
Here, we use the matchAll() method on the allText string to find all matches of the question pattern. We iterate through the matches and extract the question number, start index, and end index for each match. We then extract the question text using the start and end indices, trim any extra whitespace, and store the full question (including the question number) in the extractedQuestions array.
Step 7: Create a Parser for Structured Output
const parser = StructuredOutputParser.fromZodSchema(
z
.array(
z.object({
no: z.number().int().positive().describe("Question number"),
question: z.string().nonempty().describe("Question"),
options: z
.object({
A: z.string().nonempty().describe("Option A"),
B: z.string().nonempty().describe("Option B"),
C: z.string().optional().describe("Option C"),
D: z.string().optional().describe("Option D"),
E: z.string().optional().describe("Option E"),
F: z.string().optional().describe("Option F"),
G: z.string().optional().describe("Option G"),
H: z.string().optional().describe("Option H"),
})
.strict()
.describe("Options"),
ans: z
.array(z.enum(["A", "B", "C", "D", "E", "F"]))
.min(1)
.describe("Answers"),
explanation: z.string().optional().describe("Explanation"),
result: z.string().nonempty().describe("result"),
})
)
.nonempty()
.describe("Array of exam questions")
);
In this step, we create a parser for structured output using the Zod schema library. The schema defines the structure and validation rules for the extracted exam questions. The parser will be used to parse and validate the responses generated by the OpenAI language model.
Step 8: Get Format Instructions for the Parser
const formatInstructions = parser.getFormatInstructions();
We retrieve the format instructions for the parser, which provides guidance on how to structure the input and output data for the parser to work correctly. These instructions will be used in the prompt template.
Step 9 : Create a Prompt Template
const prompt = new PromptTemplate({
template: "Extract information as it is from text.\n{format_instructions}\nExtract the 'options' text from before the 'explanation' text section and check options are not explicitly written in the prompt in the format 'A. option_a_text', 'B. option_b_text', 'C. option_c_text', etc., set the 'options' field as an empty object and provide the 'result' field as 'failed'. If options are present in correct format in the prompt, extract the correct 'ans' field with valid options (a, b, c, d, e) and provide the 'result' field as 'success'.\nHere is an example to understand what to do in case options are not explicitly available in the question description: \nExample question description: '1. Which cities are present in India? Explanation: Delhi because Berlin is in Germany.'\n Output : ['no' : 1, 'question': 'Which cities are present in India', 'options' : [], ans : , 'explanation' : 'Delhi because Berlin is in Germany', 'result' : 'failed']\n\nIf the explanation is empty in question description set the 'explanation' field as empty string in response. For example, if the question description is '1. What is the captial of France? A. Paris B. Berlin Answer: A Explanation:' \n Output : ['no' : 1, 'question': 'What is the captial of france?', 'options' : ['A' : 'France', 'B' : 'Berlin'], ans : (A), 'explanation' : '', 'result' : 'success']\nPlease output the extracted information in JSON codeblock format. Do not output anything except for the extracted information. Do not add any clarifying information. Do not add any fields that are not in the schema. All output must be in markdown JSON codeblock and follow the schema specified above.\nQuestion description: {inputText} ",
inputVariables: ["inputText"],
partialVariables: { format_instructions: formatInstructions },
});
We create a prompt template using the PromptTemplate class. The template string contains instructions and variables that will be filled in dynamically. In this case, the inputText variable will be replaced with the extracted questions, and the format_instructions variable will be replaced with the retrieved format instructions. Remember to add examples to the prompt so that the model understands the instructions better.
Step 10: Create an OpenAI Model Instance
const model = new OpenAI({
temperature: 0,
model: "gpt-3.5-turbo-0613",
maxTokens: 2000,
reset: true, });
We create an instance of the OpenAI language model using the OpenAI class. The instance is configured with the desired settings, such as temperature, model version, maximum tokens, and reset behavior.
Step 11: Generate Responses for Each Question
let extractedOutput = [];
let totalRounds = extractedQuestions.length;
let currentQuestionNumber = 0;
for (let i = 0; i < totalRounds; i++) { currentQuestionNumber = i + 1; const input = await prompt.format({ inputText: extractedQuestions[i], });
// Code for generating and parsing the response }
In this step, we iterate through each extracted question and generate responses for them using the OpenAI language model. We format the input by replacing the inputText variable in the prompt template with the current question. The generated response will be parsed and processed in the subsequent steps.
Step 12: Generate response and Parse and Process it
for (let i = 0; i < totalRounds; i++) {
currentQuestionNumber = i + 1;
const input = await prompt.format({
inputText: extractedQuestions[i],
});
let response;
try {
// Generate response from the model
response = await model.call(input);
try {
// Parse the response using the output parser
let output = await parser.parse(response);
extractedOutput.push(output[0]);
} catch (err) {
let errObj = {
no: currentQuestionNumber,
result: "failed",
};
extractedOutput.push(errObj);
}
} catch (err) {
let errObj = {
no: currentQuestionNumber,
result: "failed",
remark: "failed to generate response", };
extractedOutput.push(errObj); } }
We generate a response from the OpenAI language model by calling the call() method on the model instance. The input parameter contains the formatted prompt for the current question.We parse the response using the structured output parser we created earlier. If the parsing is successful, the parsed output is stored in the extractedOutput array. If an error occurs during parsing, an error object indicating the failed question number is added to the extractedOutput array.
Step 13: Save Extracted Output to a File
const outputFilePath = "extracted_output.json"; fs.appendFile(finalOutputFilePath, JSON.stringify(finalOutput), (err) => { if (err) console.log(err); });
Finally, we save the extracted output to a JSON file. The extractedOutput array is serialized to JSON format using JSON.stringify(), and then written to a file using the appendFile() method from the fs module.
Conclusion
In conclusion, the provided code demonstrates an approach to extracting questions from a PDF file and generating structured output in JSON format. By leveraging various technologies such as the OpenAI language model, Langchain, and the Zod library for schema validation, the code achieves the desired goal of extracting questions and relevant information from the PDF.
By following the step-by-step breakdown of the code, addressing the challenges faced, and incorporating the provided tips, developers can gain insights into how to approach similar tasks involving information extraction from PDF files. With further customization and enhancements, this code can serve as a foundation for building more advanced question extraction systems or integrating it into larger applications.
Overall, the code exemplifies the power of combining natural language processing models, pattern matching techniques, and structured output parsing to automate the extraction of relevant information from unstructured documents, opening up possibilities for various applications in fields such as education, data analysis, and information retrieval.
Comments
Post a Comment