Dedoc
This sample demonstrates the use of Dedoc
in combination with LangChain
as a DocumentLoader
.
Overview
Dedoc is an open-source library/service that extracts texts, tables, attached files and document structure (e.g., titles, list items, etc.) from files of various formats.
Dedoc
supports DOCX
, XLSX
, PPTX
, EML
, HTML
, PDF
, images and more.
Full list of supported formats can be found here.
Integration details
Class | Package | Local | Serializable | JS support |
---|---|---|---|---|
DedocFileLoader | langchain_community | ❌ | beta | ❌ |
DedocPDFLoader | langchain_community | ❌ | beta | ❌ |
DedocAPIFileLoader | langchain_community | ❌ | beta | ❌ |
Loader features
Methods for lazy loading and async loading are available, but in fact, document loading is executed synchronously.
Source | Document Lazy Loading | Async Support |
---|---|---|
DedocFileLoader | ❌ | ❌ |
DedocPDFLoader | ❌ | ❌ |
DedocAPIFileLoader | ❌ | ❌ |
Setup
- To access
DedocFileLoader
andDedocPDFLoader
document loaders, you'll need to install thededoc
integration package. - To access
DedocAPIFileLoader
, you'll need to run theDedoc
service, e.g.Docker
container (please see the documentation for more details):
docker pull dedocproject/dedoc
docker run -p 1231:1231
Dedoc
installation instruction is given here.
# Install package
%pip install --quiet "dedoc[torch]"
Note: you may need to restart the kernel to use updated packages.
Instantiation
from langchain_community.document_loaders import DedocFileLoader
loader = DedocFileLoader("./example_data/state_of_the_union.txt")
API Reference:DedocFileLoader
Load
docs = loader.load()
docs[0].page_content[:100]
'\nMadam Speaker, Madam Vice President, our First Lady and Second Gentleman. Members of Congress and t'
Lazy Load
docs = loader.lazy_load()
for doc in docs:
print(doc.page_content[:100])
break
Madam Speaker, Madam Vice President, our First Lady and Second Gentleman. Members of Congress and t
API reference
For detailed information on configuring and calling Dedoc
loaders, please see the API references:
- https://python.langchain.com/api_reference/community/document_loaders/langchain_community.document_loaders.dedoc.DedocFileLoader.html
- https://python.langchain.com/api_reference/community/document_loaders/langchain_community.document_loaders.pdf.DedocPDFLoader.html
- https://python.langchain.com/api_reference/community/document_loaders/langchain_community.document_loaders.dedoc.DedocAPIFileLoader.html