Automated Bank Statement Analysis Using GPT, Python and Langchain
Introduction
In the rapidly evolving landscape of financial management, the ability to swiftly extract and analyze transaction data from bank statements is indispensable. This project showcases a dynamic solution that merges cutting-edge technologies – Python programming and GPT models and Langchain – to automate the extraction of transaction information from bank statement PDFs.
By harnessing the power of image processing, Optical Character Recognition (OCR), and the GPT-4 model, this project streamlines the traditionally laborious task of poring over bank statements. The resulting automation not only saves valuable time but also enhances the accuracy of transaction data extraction, contributing to more efficient financial record management. In this article, we will delve into the step-by-step breakdown of the code, highlighting how each stage contributes to the creation of this automated bank statement analysis tool.
Aim of the Project
At its core, this project aims to develop a sophisticated system for automating the extraction, analysis, and organization of transaction details from bank statement PDFs. The technical objectives of the project are as follows:
1. GPT-Powered Query Generation and Retrieval: Leveraging GPT-3 models, particularly the ChatOpenAI model, the project's technical objective is to generate contextually relevant queries. These queries are aimed at retrieving transaction information from the extracted text data. GPT's natural language capabilities are harnessed to craft meaningful and contextually accurate questions.
2. Image Preprocessing for OCR Optimization: The project aims to enhance OCR accuracy by implementing image preprocessing techniques. These techniques include adaptive thresholding and Gaussian blur to reduce noise and improve the quality of textual content within the images.
3. Accurate Text Extraction via OCR: The system seeks to employ Optical Character Recognition (OCR) to accurately transcribe textual information from preprocessed bank statement images. This involves configuring OCR parameters, such as page segmentation mode and whitelist characters, to ensure precise extraction.
4. Structured Data Organization: The extracted transaction details are intended to be organized in a structured format, ensuring coherence and consistency across the data. Each transaction entry is formatted as JSON objects with distinct keys for date, title, amount, balance, and validity.
5. Data Filtering and Validation: The system is designed to intelligently filter out irrelevant or erroneous data entries, ensuring that only valid and meaningful transactions are considered for analysis. This involves implementing data validation checks and utilizing techniques to identify and discard irrelevant information.
6. Flexible Adaptability: A key technical goal of the project is to make the system adaptable to varying bank statement layouts and content structures. This adaptability involves implementing algorithms and methods to handle diverse data patterns commonly encountered in different bank statements.
7. Efficient JSON and CSV Conversion: The project aims to facilitate seamless data export by converting the structured transaction data into JSON and CSV formats. This conversion ensures that the processed data is readily accessible for further analysis and reporting.
8. Optimized Performance: The project's technical scope involves optimizing the performance of the overall system, minimizing processing time, memory usage, and computational resources required for data extraction, analysis, and transformation.
STEP BY STEP GUIDE :-
We will be using a variety of libraries such as fitz, PIL, numpy, cv2, pytesseract, pandas, and custom modules from the langchain package. These libraries help us to extract images from the PDF, preprocess them, extract text using OCR, apply text analysis, and organize the results. Let's break down the script into steps for a better understanding.
Step 1: Importing Libraries and Setting Up Paths
In this step, neIcessary libraries are imported, including image processing (PIL, numpy, cv2), text extraction (pytesseract), data handling (pandas), and custom modules. The paths for the PDF, image folder, and text folder are defined. Also you will have to create .env file and paste your OpenAI API key there in the format -
OPENAI_API_KEY=`YOUR_API_KEY`
Step 2: Extracting Images and Preprocessing
This step involves opening the PDF and iterating through its pages. For each page, the get_pixmap function is used to extract a pixmap representation of the page. This pixmap is then converted into a NumPy array using numpy. The resulting array represents the image content of the page. To enhance the quality of the text within the images, several image preprocessing techniques are applied. These include adaptive thresholding to convert the image into a binary format, Gaussian blur to reduce noise, and contrast enhancement using the ImageEnhance module from the PIL library.
Step 3: Extracting Text Using OCR
In this step, each preprocessed image is loaded, and OCR is applied using pytesseract. The extracted data dictionary is converted into a pandas DataFrame for easier manipulation and analysis. Rows with confidence (conf) values of -1, empty or single-space text are filtered out, resulting in a filtered DataFrame df1.
The filtered DataFrame df1 is grouped by the block number (block_num) and sorted by the vertical position (top) within each block. The loop iterates through each grouped block and reconstructs the text within each block. It takes care of line breaks and formatting. The organized text is then saved into separate text files for each page.
Step 4: Utilizing GPT for Transaction Retrieval
In this step we have used Langchain for doing the extractive QA over the bank statement text. https://python.langchain.com/docs/use_cases/question_answering/how_to/vector_db_qa
Here is a basic flow of doing QA over docs using Langchain :-
Data Loading: Begin by importing unstructured data from various sources using LangChain's integration hub. Different loaders facilitate this process, transforming data into LangChain Documents. We have used TextLoader here.
Segmentation: Employ text splitters to break down Documents into smaller sections of specified sizes. This segmentation aids in effective data handling. CharacterTextSplitter helps us in doing the same
Storage: Utilize: storage solutions, often vectorstores, to house and embed these segmented sections, enhancing their utility and context. For the same purpose we have used ChromaDB here. Embeddings are an index of text string relatedness and are expressed by a vector (list) of floating point integers.
Retrieval: Access segmented data from storage, often employing embeddings similar to input questions for accurate retrieval. Langchain provides an abstraction over this logic using QA chains. Here the RetrievalQA chain will do this work for us by pulling in the context, which is a chain for question-answering against an index. You can provide a custom prompt template to the chain just like the code above and give the structure/format in which you want to extract the data.
Step 5:- Structuring the result
By breaking down the code into these steps, we've outlined the progression of tasks involved in extracting transaction details from a bank statement PDF and organizing them into a structured format for further analysis.
Conclusion
In summary, the integration of cutting-edge techniques showcased here has the potential to revolutionize the way we handle data extracted from bank statement images. By combining image preprocessing, advanced language models, and efficient data retrieval, we unlock efficiency gains, improved accuracy, and valuable insights previously hidden in unstructured data.
Beyond bank statements, these methods can be applied across industries, offering smarter data handling and enabling intelligent conversations with machines. As technology evolves, so does the scope for enhanced data extraction, analysis, and utilization
Comments
Post a Comment