0

Fuzzy matching emailaddresses (or LinkedIn profiles...) etc etc

I really like the function to search on some words and get on to see the results from a file/files. 

1 line per result.

 

I don't want to spend hours and hours to copy paste text and look for results and get the results back into a CRM or CSV/Excel file. 

So automation!

Each data file in Ninox has its own URL.

Take that URL in Octoparse and see what is possible.

 

And I am really not familiar with sql/python/R/pandas or other tooling.

So Octoparse was a good choice for me.

 

 

I merged the columns with lastname and domain into a new column.

The text from this new column I use for a search in Ninox to get results.

Matching results is on the value in the newly added column.

 

An example for matching.

Lastname: Staaks

Domain: vandaatselaar.nl

 

I got my emailadresses by processing them in a webcrawler combined with an extraction for emailaddresses (RegEx).

 

The found emailaddresses are deduped and filtered out some emailaddresses like info@

 

The results I moved to Ninox.

 

I used Octoparse to get the results into a new file.

The results are later on joined in to the CSV (database).

 

You can use Octoparse for free.

They have a free plan.

The export is limited to 10.000 rows of data.

https://www.octoparse.com/pricing

 

You can use the URL of the dataset in Ninox.

Put this URL in Octoparse.

 

 

You might be prompted to login.

Do so in 'browse mode'.

Use the cookie from the login.

 

Enter text in the loop.

So I entered about 9.100 text lines from the column with the merged data from lastname and domainname.

Extract the data you want.

 

I used these field for extraction:

search text

found emailaddress

domainname

 

Use a click item to 'reset' the searchbar.

If you don't do this; entering a new text fails and you won't get any new results.

This is the XPath

//div[@class="hud-menu-search-placeholder"]/div[1]

 

 

When you want to run it, you can choose in 'Task Settings' for task split.

 

 You can run in local or in the cloud (paid version only).

https://helpcenter.octoparse.com/en/articles/6470987-run-task-locally-with-octoparse-8-5

 Boosting is about 3 times faster as Standard Mode.

 

Some found results.

 

 

 It was better to choose only the text containing Last name AND domainname.

Now I just used everything (also only a lastname without a domainname).

 

9.174 searches took 1h 31m in boost mode.

 4h 40m in Normal mode

 

But....

in Normal Mode: 2.303 results

and in Boost Mode: 1.511 results

 

Boost Mode uses 3 parallel tasks. 

It splits up the search text it self.

 

 

I will test this week with fuzzy match on LinkedIn profiles.

Antwort

null

Content aside

  • vor 3 MonatenZuletzt aktiv
  • 40Ansichten
  • 1 Folge bereits