Notifications

Clear all

Extracting Information from Files and PDFs

How To

Last Post by system 1 year ago

5 Posts

4 Users

0 Reactions

339 Views

RSS

Sandeep

(@sandeep)

Posts: 2

New Member

Topic starter

Hello Team,

I am exploring callin.io and I have a PDF file that contains an image with textual information within it, alongside regular text.

When I use the “Extract from File” PDF Operation in the output JSON, I successfully retrieve the "normal text information" present in the PDF. However, the image (containing text) is skipped, which is expected behavior as the PDF operation is designed to read only text.

My PDF includes pages with both Image (with text) + regular text, and pages with only regular text. Is it possible for the node to provide a notification? This would allow me to use an OCR API for pages containing images with text to extract the text from the image.

Could you please suggest a workflow on how I can achieve this requirement of handling Images (with text) + regular text using a callin.io workflow? Thank you for your assistance.

Posted : 17/10/2024 3:27 pm

n8n

(@n8n)

Posts: 75

Trusted Member

It appears your topic is missing some crucial details. Could you please provide the following information, if relevant?

callin.io version:
Database (default: SQLite):
callin.io EXECUTIONS_PROCESS setting (default: own, main):
Running callin.io via (Docker, npm, callin.io cloud, desktop app):
Operating system:

Please share these details to help us understand your issue better.

Posted : 17/10/2024 3:27 pm

Simon_Coton

(@simon_coton)

Posts: 1

New Member

Hi Sandeep,

I'd suggest a workflow similar to what you've already mentioned:

Extract text directly from the PDF.
Send the same PDF to an OCR API to retrieve text embedded within images.
Consolidate the extracted text, perhaps by PDF name, to group all related information.

Regarding how to detect the presence of an image, what does the "Extract from File" node output when it encounters an image? Does it indicate that an image is present, or does it return nothing? If it returns nothing, you could potentially add a step that invokes an AI node. You would upload the PDF to this AI node, and it could then provide information about which pages contain images.

Posted : 17/10/2024 3:56 pm

Sandeep

(@sandeep)

Posts: 2

New Member

Topic starter

Hello Simon,

Thank you for the prompt reply. To answer your question, the “Extract from File” node doesn't output anything. If a PDF page contains an image with text alongside regular text, it only returns the regular text. Following your suggestion:

Step 1 → Upload PDF to an AI Node that identifies pages with images.
Step 2 → Send those specific PDF pages to an OCR API to extract text from images.
Step 3 → Send the remaining PDF pages to the existing “Extract from File” PDF Operation to retrieve text from those pages.
Step 4 → Combine the results from Step 2 and Step 3.

Would it be feasible to split the PDF by page(s)? This would allow us to send the PDF pages containing images to the OCR API.

Thank you

Posted : 17/10/2024 5:04 pm

system

(@system)

Posts: 241

Estimable Member

This thread was automatically closed 90 days following the last response. New replies are no longer permitted.

Posted : 15/01/2025 5:04 pm

8 Forums
998 Topics
5,606 Posts
0 Online
2,483 Members

Forum Icons: Forum contains no unread posts Forum contains unread posts

Topic Icons: Not Replied Replied Active Hot Sticky Unapproved Solved Private Closed