Skip to content
Need help manipulat...
 
Notifications
Clear all

Need help manipulating text from the HTML Extract module

10 Posts
2 Users
0 Reactions
3 Views
automatron
(@automatron)
Posts: 5
Active Member
Topic starter
 

I'm looking to extract a specific text string from my HTML Extract module's output.

The HTML Extract module has successfully retrieved the following text:

Condition: Used: An item that has been previously used. See the seller’s listing for full details and description of any imperfections. See all condition definitions– opens in a new window or tab ... Read moreabout the condition Year: 2002 Mileage: 42050

Currently, I'm sending this entire description to Airtable in the subsequent module. Instead, I'd like to insert a step before sending to Airtable where, using a JavaScript function or another suitable module, I can process the text. Specifically:

I want to extract the year and mileage from this text and store them as data points. These will then be sent to their respective columns (also named “age” and “mileage” in Airtable), along with the other fields extracted by my HTML Extract module.

Does anyone have any suggestions on how to achieve this?

Here's the code for my HTML Extract Module:

 
Posted : 22/12/2020 5:40 am
harshil1712
(@harshil1712)
Posts: 20
Eminent Member
 

Hello!

Below is an example Set node that might be helpful. The key aspect to note here is the snippet .match(/[0-9]d{3}/).toString() for the year and .match(/[0-9]d{4}/).toString() for mileage. These snippets extract the value of the year and mileage respectively, and then convert them into a string. Therefore, when you reference these values in the Set node, utilize these snippets to pull out the required data.

Hope this is helpful!

:slightly_smiling_face:

 
Posted : 22/12/2020 6:09 am
automatron
(@automatron)
Posts: 5
Active Member
Topic starter
 

Hello Harshil, this is excellent, thank you! I had to make a small adjustment but got it working with:
{{$node["HTML Extract"].json["specifics"].match(/[0-9]d{*3 or*4}/).toString()}}

I recognize .match().toString() as JavaScript functions and plan to learn more about them on W3C. Is the “(/[[0-9]]d{3 or4}/)” part a regular expression? Do you have any recommended reading or tutorials to understand it better?

I'm very impressed with callin.io and the community. I've been following for a while and have seen your helpful responses on many threads - thank you!

:slight_smile:

 
Posted : 22/12/2020 6:43 am
harshil1712
(@harshil1712)
Posts: 20
Eminent Member
 

I'm glad to hear it's working!

Yes, .match() and .toString() are indeed JavaScript functions. Since .match() returns an array of matched items, we use .toString() to convert that array into a string.

I typically use Regex 101 for constructing and testing regular expressions. Additionally, I consult the MDN documentation for Regex.

Thank you for your positive feedback!

:slightly_smiling_face:

 
Posted : 22/12/2020 6:50 am
automatron
(@automatron)
Posts: 5
Active Member
Topic starter
 

Thanks Regex101, it looks very useful. I've tested your regex there. Could you explain why you're matching on "d"? My understanding was that you'd be looking for the words "year" or "mileage" and capturing the digits that follow them, but your code doesn't seem to be doing that.

If I had a different data set where the mileage was simply 100 (instead of 42050), would the same code still function correctly?

 
Posted : 22/12/2020 6:59 am
harshil1712
(@harshil1712)
Posts: 20
Eminent Member
 

The d checks for digits. For Mileage, this might not work if the length isn't fixed. The approach you're proposing seems more logical.

 
Posted : 22/12/2020 7:16 am
automatron
(@automatron)
Posts: 5
Active Member
Topic starter
 

Thanks Harshil, this is all incredibly helpful and I'm gaining a lot of knowledge. I've just finished some fundamental tutorials on regex and understand that I require the following expression to capture the mileage:

{{$node["HTML Extract"].json["specifics"].match(/Mileage:s*(d+)/).toString()}}

I'm utilizing the brackets around d+ to "capture" solely the numbers following the word "Mileage:". This should theoretically function, but when using the set module, it yields the following result:

Mileage: 



















































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































A quick tip for anyone using the `match` method in JavaScript: when you use parentheses in your regex to capture specific groups, the `match()` function returns an array. The first element of this array is the full match, and subsequent elements are the captured groups. To get just the captured number, you'll want to access the second element of the array (index 1). So, instead of `.toString()`, try accessing the captured group directly. For example, if your regex is `/Mileage:s*(d+)/`, you would use `result[1]` to get the captured number.

This should resolve the issue of getting the full string along with the captured number. Let me know if you need further assistance!

 
Posted : 22/12/2020 8:15 am
harshil1712
(@harshil1712)
Posts: 20
Eminent Member
 

Yes, the .match() method will indeed return the matches. Since we're also searching for the word "Mileage", it gets included in the results. This is a scenario where you might consider using the .replace() method to substitute (or remove, in this instance) specific data with the desired information. I hope this documentation proves helpful.

 
Posted : 22/12/2020 9:44 am
automatron
(@automatron)
Posts: 5
Active Member
Topic starter
 

For completeness, here is the final solution that worked for me. I had to use 2 nodes but maybe it is possible to do it with just 1 for someone who is more proficient in Javascript/callin.io.

  1. Function node:
items[0].json.Mileage = $node["HTML Extract"].json["specifics"].match(/Mileage:s*(d+)/);
items[0].json.Year = $node["HTML Extract"].json["specifics"].match(/Year:s*(d+)/);
items[0].json.Colour = $node["HTML Extract"].json["specifics"].match(/Colour:s*(w*)/);
return items;

The above RegEx works in my case to grab numbers and words that come after a certain identifier like “Mileage:” or “Colour:”, removing all whitespace in between.

  1. Set node. Create a new string to set for each of the variables above with an expression like this:
    {{$node["Function"].json["Colour"][1].toString()}}

Thank you for the pointers and tips along the way.

:smile:

 
Posted : 23/12/2020 4:23 pm
harshil1712
(@harshil1712)
Posts: 20
Eminent Member
 

This is great! Have fun!

 
Posted : 24/12/2020 3:51 am
Share: