Skip to content
Overview of Differe...
 
Notifications
Clear all

Overview of Different Web Scraping Techniques in callin.io

3 Posts
2 Users
0 Reactions
3 Views
samliew
(@samliew)
Posts: 293
Reputable Member
Topic starter
 

As this is a frequently asked question, I've put together this post to explore the various methods for web scraping using callin.io. Each approach offers different levels of complexity and control.

Traditional Web Scraping + Text Parser

If you prefer not to rely on external services that might incur costs, you can fetch the page content using the HTTP "Make a request" module. Subsequently, you can employ a Text Parser "Match Pattern" module to locate and extract the desired content from the page's source code.

To achieve this effectively, a solid understanding of regular expression patterns is necessary. These patterns can become quite intricate, especially when aiming to match multiple content elements on a page with a single Match Pattern module. Alternatively, you could use a separate Match Pattern module for each piece of content you wish to extract, though this approach consumes more operations.

Alternatives to consider:

  • XML "Perform XPath Query" —
    You can extract items using XPath, but it requires a separate module for each extraction.
  • Set Multiple Variables —
    It's possible to use negative regular expressions with the replace function to remove unwanted content, thereby isolating the desired "match".

Need help with complex web scraping requirements, building a pattern for your Text Parser, AI prompt engineering, or have some other callin.io-related question?
—> Let’s Talk

Hosted Web Scraping

If you wish to avoid managing web scraping directly, you can utilize services such as ScrapingBee and ScrapeNinja to retrieve content from web pages.

ScrapeNinja offers jQuery-like selectors within its extractor function, which is essentially how elements are targeted on a page. This method avoids the use of regular expressions, although regex can still be employed in the extractor function if needed.

The primary benefit of hosted web scraping services like ScrapeNinja is their capability to manage and circumvent anti-scraping mechanisms. They execute pages within a real web browser, loading all content and running page load scripts, thereby closely simulating the user experience as opposed to merely fetching raw HTML via the HTTP module. Dedicated scraping services excel in this area because they specialize in this function and perform it effectively.

For an example of ScrapeNinja usage, please refer to Grab data from page and url - #5 by samliew

Alternatives to consider:

References:

Need help with complex web scraping requirements, building a pattern for your Text Parser, AI prompt engineering, or have some other callin.io-related question?
—> Book a Consultation

Either of the Above + AI Structured Data Extraction

You can combine the traditional HTTP scraping or the hosted web scraping method to retrieve the source code of the target page. This source code can then be processed by an AI to transform it into structured data (outputting variables/collections, or JSON that requires a Parse JSON module).

This approach offers flexibility in extracting content into complex data structures (collections), but it does involve prompt engineering and the setup of the data structure, either through fields (OpenAI) or by embedding JSON within the prompt itself (Groq).

References:

Need help with complex web scraping requirements, building a pattern for your Text Parser, AI prompt engineering, or have some other callin.io-related question?
—> Submit Enquiry

AI-powered Web Scraping

This is likely the most straightforward and rapid method to implement, as it only requires you to describe the content you need, rather than inspecting elements to create selectors or devising regular expression patterns.

The advantage here is that such services integrate both fetching and data extraction into a single module (saving operations) and eliminate the lengthy setup required by other methods.

Here's a simple illustration using the Dumpling AI "Extract data from URL" module:

As you can see, this can be accomplished effortlessly within seconds using Dumpling AI. Simply map the URL variable in the module and specify the fields you wish to extract from the page! (You don't even need to define the data type).

Furthermore, if you don't require structured data and simply want to pass the page content to another AI for further analysis, you can use the "Scrape URL" module. This module also removes extraneous elements like headers and footers, leaving only the main article content. This is particularly beneficial for training LLMs (e.g., OpenAI, HuggingFace, etc.).

To learn more about Dumpling AI, consult the official documentation at API Reference - Dumpling AI Docs


For those comfortable with regular expressions, traditional web scraping using the "Make a request" and "Match Pattern" modules allows for precise control over data extraction. However, this method can become complex when dealing with multiple data points. Hosted web scraping services like ScrapeNinja offer a more user-friendly approach with jQuery-like selectors and the capability to handle anti-scraping measures. AI-powered web scraping with tools like Dumpling AI provides the simplest and fastest setup, requiring only a description of the desired content for extraction. While this method offers great ease of use, it may provide less granular control over specific data points.

View my profile for more helpful links and articles like these (you might need to be logged in to view forum profiles):

Professional Services

Need help with complex web scraping requirements, building a pattern for your Text Parser, AI prompt engineering, or have some other callin.io-related question?
—> Get Expert Help

 
Posted : 23/08/2024 5:51 am
samliew
(@samliew)
Posts: 293
Reputable Member
Topic starter
 

Here is more information about the Dumpling AI integration in callin.io.

AI Agents

AI agents are pretrained on your data and knowledgebase for RAG (Retrieval-Augmented Generation). You can set one up in the dashboard and then call the Dumpling AI “Generate AI Agent Completion” module:

Runs AI Agent completion and returns the result

For more information, see the official documentation at Build Custom AI Agents, Simply.

(source: Dumpling AI website)

Run JavaScript (with plugins)

If you need to run JavaScript/TypeScript with JS libraries (NPM packages) in your scenario, you can consider Dumpling AI’s “JavaScript Code Execution API” available via the “Run Javascript Code” module —

Run your javascript or typescript code and get the result back.

The official documentation on how to use NPM modules with this module can be found here.

DumplingAI also does so much more, see also:

Examples of How to use Dumpling AI

For more information, see these Dumpling AI tutorials below, grouped by category:

YouTube & Videos

Image Generation

AI Agents & RAGs

Searching & Scraping

Other Data Extraction

Business & Social

Dumpling AI Tutorials

In short, Dumpling AI is able to replace several other paid services combined that would cost more than itself, making it a noteworthy choice as the “multi-tool” of AI services.

How to Use

For more information on how to set this up, refer to these forum threads:

View my profile for more useful links and articles like these!

Connect with me

 
Posted : 23/08/2024 6:11 am
system
(@system)
Posts: 332
Reputable Member
 

This discussion was automatically closed after 29 days. New responses are no longer permitted.

 
Posted : 22/09/2024 5:51 am
Share: