Hello everyone,
I'm trying to set up a scenario and have encountered some difficulties. I was hoping someone could offer some assistance.
The workflow I'm aiming for is as follows:
- Fetch new documents from the "Thư viện pháp luật" ( https://thuvienphapluat.vn/ ) website.
- Filter these documents to include only those pertaining to science and technology, innovation, and digital transformation.
- Summarize the key details (Title, date, and content) of a particular post.
- Log this summarized information into a Google Sheet.
I believe the most challenging aspect is automating the access to specific news articles and extracting the relevant data. Any help or guidance would be greatly appreciated. Thank you all very much.
Welcome to the callin.io community!
So you essentially need to “visit” the site yourself to retrieve the content. This process is known as Web Scraping.
Incomplete Scraping
Are you receiving NO output from the Text Parser “HTML to Text” module? This is because there is NO text content within the HTML! The entire page content you are scraping is hosted in a script tag, which is dynamically generated and placed onto the page using JavaScript when loaded and executed in the user’s web browser on the client-side. callin.io is a server-side runtime environment, so by using the HTTP modules, you obtain only the script tags, and those script tags are ignored by the Text Parser “HTML to Text” module because it is NOT an HTML layout element.
Using the callin.io HTTP “Make a request” does NOT execute any of those JavaScript scripts, resulting in no content on the page other than a default message prompting you to enable JavaScript.
This is NOT an issue/bug with the callin.io platform, the Text Parser, or Regular Expressions.
You CANNOT use standard scraping integrations like ScrapingBee or the HTTP “Make a request” module to fetch this page’s structure.
You will need to utilize ScrapeNinja’s “Scrape (Real browser)” module to emulate a real user visiting the site using a web browser, as client-side JavaScript needs to run to parse the JSON data within the script tags and generate the page structure and content.
For additional information and a demonstration using ScrapeNinja, please refer to Scraping Bee Integration Runtime Error 400
Web Scraping
For web scraping, a service you can utilize is ScrapeNinja to obtain content from the page.
ScrapeNinja allows you to employ jQuery-like selectors to extract content from elements by using an extractor function. ScrapeNinja can also run the page in a real web browser, loading all content and executing page load scripts, thus closely simulating what you see, as opposed to just the raw page HTML fetched via the HTTP module.
If you would like an example, please examine Grab data from page and url - #5 by samliew
AI-powered “easier” method
You can also leverage AI-powered web scraping tools such as Dumpling AI.
This is likely the most straightforward and rapid method to set up, as all you need to do is describe the content you require, rather than inspecting elements to create selectors or devising regular expression patterns.
The advantage here is that such services combine BOTH fetching and extracting the data within a single module (conserving operations), and eliminate the lengthy setup required by other methods.
More information, other methods
For further details on various web scraping techniques, please consult Overview of Different Web Scraping Techniques in Make 🌐
If you require additional assistance, kindly provide the following:
1. Relevant Screenshots
Could you please share screenshots of your entire scenario? Also, include screenshots of any error messages, module configurations (fields), relevant filter settings (conditions), and module output bundles. We need to see your setup to offer the best guidance.
You can upload images here using the Upload icon in the text editor:
We would appreciate it if you could upload screenshots directly here rather than linking to them externally. This enables us to zoom in on images when clicked and prevents tracking cookies from third-party websites.
2. Scenario Blueprint
Please export the scenario blueprint. Providing your scenario blueprint file will allow others to quickly replicate and view your mappings in each module. It also enables us to share screenshots or module exports of any solutions we have for you in return – this will greatly benefit you in implementing our suggestions as you can simply paste module exports back into your scenario editor!
To export your scenario blueprint, click the three dots at the bottom of the editor and select ‘Export Blueprint’.
You can upload the file here by clicking this button:
3. Output Bundles of Modules
Please provide the output bundles of each relevant module by running the scenario (you can also retrieve this without re-running your scenario from the History tab).
Click on the white speech bubbles located at the top-right of each module and select “Download input/output bundles”.
A. Upload as a Text File
Save the contents of each bundle in a plain text editor (without formatting) as a bundle.txt
file.
You can upload the file here by clicking this button:
B. Insert as Formatted Code Block
If you are unable to upload files on this forum, you can alternatively paste the formatted bundles.
These are the two methods for formatting text so it won’t be altered by the forum:
-
Method 1: Type code block manually
Add three backticks
```
before and after the content/bundle, like this:```
content goes here
``` -
Method 2. Highlight and click the format button in the editor
Providing the input/output bundles will enable others to replicate the scenario's behavior, especially with complex data structures (nested arrays and collections) or when external services are involved. It also aids in mapping raw property names from collections.
Sharing these details will facilitate assistance from others.
Hope this helps! Let me know if there are any further questions or issues.
— @samliew
P.S.: Investing some effort into the callin.io Academy will save you considerable time and frustration when using callin.io.