Spaces:

HanLee
/

Demo

Sleeping

App Files Files Community

HanLee commited on Nov 25, 2023

Commit

99b7cde

1 Parent(s): cb31088

feat: heya

Browse files

Files changed (2) hide show

README.md +3 -4
app/app.py +66 -1

README.md CHANGED Viewed

@@ -5,9 +5,8 @@ _See the readme file in the main branch for updated instructions and information
 ## Lab3: Enabling Load PDF to Chainlit App
 Building on top of the current simplified version of ChatGPT using Chainlit, we now going to add loading PDF capabilities into the application.
- NowNow we have a web interface working, we will now add an LLM to our Chainlit app to have our simplified version of ChatGPT. We will be using [Langchain](https://python.langchain.com/docs/get_started/introduction) as the framework for this course. It provides easy abstractions and a wide varieties of data connectors and interfaces for everything LLM app development.
-In this lab, we will be adding an Chat LLM to our Chainlit app using Langchain.
 ## Exercises
@@ -34,6 +33,6 @@ chainlit run app/app.py -w
 ## References
-- [Langchain's Prompt Template](https://python.langchain.com/docs/modules/model_io/prompts/prompt_templates/#chatprompttemplate)
-- [Langchain documentation](https://python.langchain.com/docs/modules/chains/foundational/llm_chain#legacy-llmchain)
 - [Chainlit's documentation](https://docs.chainlit.io/get-started/pure-python)

 ## Lab3: Enabling Load PDF to Chainlit App
 Building on top of the current simplified version of ChatGPT using Chainlit, we now going to add loading PDF capabilities into the application.
+In this lab, we will utilize the build in PDF loading and parsing connectors inside Langchain, load the PDF, and chunk the PDFs into individual pieces with their associated metadata.
 ## Exercises
 ## References
+- [Langchain PDF Loaders](https://python.langchain.com/docs/modules/data_connection/document_loaders/pdf)
+- [Langchain Text Splitters](https://python.langchain.com/docs/modules/data_connection/document_transformers/#text-splitters)
 - [Chainlit's documentation](https://docs.chainlit.io/get-started/pure-python)

app/app.py CHANGED Viewed

@@ -1,9 +1,74 @@
 import chainlit as cl
 from langchain.chat_models import ChatOpenAI
 from langchain.prompts import ChatPromptTemplate
-from langchain.schema import StrOutputParser
 from langchain.chains import LLMChain
 @cl.on_chat_start
 async def on_chat_start():

+from tempfile import NamedTemporaryFile
+from typing import List
 import chainlit as cl
+from chainlit.types import AskFileResponse
 from langchain.chat_models import ChatOpenAI
 from langchain.prompts import ChatPromptTemplate
+from langchain.schema import Document, StrOutputParser
 from langchain.chains import LLMChain
+from langchain.document_loaders import PDFPlumberLoader
+from langchain.text_splitter import RecursiveCharacterTextSplitter
+def process_file(*, file: AskFileResponse) -> List[Document]:
+    """Processes one PDF file from a Chainlit AskFileResponse object by first
+    loading the PDF document and then chunk it into sub documents. Only
+    supports PDF files.
+    Args:
+        file (AskFileResponse): input file to be processed
+    Raises:
+        ValueError: when we fail to process PDF files. We consider PDF file
+        processing failure when there's no text returned. For example, PDFs
+        with only image contents, corrupted PDFs, etc.
+    Returns:
+        List[Document]: List of Document(s). Each individual document has two
+        fields: page_content(string) and metadata(dict).
+    """
+    # We only support PDF as input.
+    if file.type != "application/pdf":
+        raise TypeError("Only PDF files are supported")
+    with NamedTemporaryFile() as tempfile:
+        tempfile.write(file.content)
+        ######################################################################
+        # Exercise 1a:
+        # We have the input PDF file saved as a temporary file. The name of
+        # the file is 'tempfile.name'. Please use one of the PDF loaders in
+        # Langchain to load the file.
+        ######################################################################
+        loader = PDFPlumberLoader(tempfile.name)
+        documents = loader.load()
+        ######################################################################
+        ######################################################################
+        # Exercise 1b:
+        # We can now chunk the documents now it is loaded. Langchain provides
+        # a list of helpful text splitters. Please use one of the splitters
+        # to chunk the file.
+        ######################################################################
+        text_splitter = RecursiveCharacterTextSplitter(
+            chunk_size=3000,
+            chunk_overlap=100
+        )
+        docs = text_splitter.split_documents(documents)
+        ######################################################################
+        # We are adding source_id into the metadata here to denote which
+        # source document it is.
+        for i, doc in enumerate(docs):
+            doc.metadata["source"] = f"source_{i}"
+        if not docs:
+            raise ValueError("PDF file parsing failed.")
+        return docs
 @cl.on_chat_start
 async def on_chat_start():