Hello everyone,
I'm facing an issue with the Scrape Ninja module within callin.io. When I utilize Scrape Ninja (Real browser) on certain websites, it fails to capture all the data present on the page. In contrast, when I switch to using Dumpling AI, it successfully scrapes all the required data.
I'd prefer to use Scrape Ninja, so how can I resolve this?
Welcome to the callin.io community!
You need to utilize the Extractor function from ScrapeNinja to correctly extract data from web pages. This is an advanced technique, and if you are not familiar with JavaScript, it's recommended to continue using Dumpling AI instead.
Alternatively, you can test your extractor function using the SN Scraper Sandbox tool: ScrapeNinja Live Sandbox
Essentially, you need to "visit" the site yourself to obtain the content. This process is known as Web Scraping.
Incomplete Scraping
Are you receiving no output from the Text Parser's “HTML to Text” module? This occurs because there is no text content within the HTML! The entire page content you are scraping is housed within a script tag, which is dynamically generated and inserted onto the page using JavaScript when loaded and executed in the user's web browser on the client-side. callin.io is a server-side runtime environment, so when using the HTTP modules, you only receive the script tags. These script tags are ignored by the Text Parser “HTML to Text” module because they are not HTML layout elements.
Using callin.io's HTTP “Make a request” does not execute any of these JavaScript scripts. Consequently, there is no content on the page other than a default message prompting you to enable JavaScript.
This is not an issue or bug with the callin.io platform, the Text Parser, or Regular Expressions.
You CANNOT use standard scraping integrations like ScrapingBee or the HTTP “Make a request” module to fetch this page's structure.
You will need to use ScrapeNinja's “Scrape (Real browser)” module to emulate a real user visiting the site using a web browser. This is necessary because client-side JavaScript needs to run to parse the JSON data within the script tags and generate the page structure and content.
For additional information and a demonstration using ScrapeNinja, please refer to Scraping Bee Integration Runtime Error 400
Web Scraping
For web scraping, a service you can utilize is ScrapeNinja to retrieve content from a page.
ScrapeNinja enables you to use jQuery-like selectors to extract content from elements by employing an extractor function. ScrapeNinja can also run the page in a real web browser, loading all content and executing page load scripts, thereby closely simulating your viewing experience, as opposed to just the raw page HTML fetched by the HTTP module.
If you'd like an example, please check out Grab data from page and url - #5 by samliew
AI-powered “easier” method
You can also leverage AI-powered web scraping tools such as Dumpling AI.
This is likely the most straightforward and rapid method to set up, as it only requires you to describe the content you need, rather than inspecting elements to create selectors or devising regular expression patterns.
The advantage of this approach is that such services integrate both fetching and extracting data within a single module (saving operations) and eliminate the lengthy setup required by other methods.
More information, other methods
For further details on various web scraping techniques, consult Overview of Different Web Scraping Techniques in Make 🌐
Hope this proves helpful! Please let me know if you have any further questions or encounter any issues.
— @samliew
P.S.: Investing some time in the callin.io Academy can significantly reduce the time and frustration you might experience using callin.io.
Thank you for your previous help! I'm contacting you regarding a technical challenge I've run into while using ScrapeNinja's real browser mode. I experimented with Dumpling AI. It appears excellent for scraping challenging websites, but it's currently beyond my budget.
Issue:
When I try to scrape data from this article using ScrapeNinja (Real browser) - (Example - https://thehill.com/homenews/administration/5181338-ssa-bans-general-news-websites/ ), the tool provides incomplete HTML content. The fundamental structure loads, but essential data (like the article body and elements loaded dynamically) is absent.
How to achieve this?
This is due to security measures implemented on the site to prevent reliable scraping.
My suggestion is that if you are unable to get ScrapeNinja to function, consider using an alternative web scraping service or finding someone to resolve the issue.
You can find other options in the links provided in my prior post or on my profile.
Hire a Pro category to request personalized 1-to-1 assistance through video calls, screenshares, private messaging, and more. This can help expedite the resolution of your issue, particularly if it is urgent or involves sensitive information. It is crucial to submit your request within the
Hire a Pro category, as forum members are prohibited from advertising their services in other sections, even if it's free. Posting in the
Hire a Pro category enables other members to assist you via alternative communication channels.
Hope this helps!
— P.S.: Investing some effort into the callin.io Academy will save you lots of time and frustration using callin.io.
Thanks! If you're aware of other tools, please let me know. I appreciate it!
I discovered a service named firecrawl.dev, and it's functioning flawlessly.