4/27/2023 0 Comments Jaccard similarity pythonNote that the density will grow sparser as we add more ads because there will be fewer near-duplicate pairs relative to the number of non-duplicate pairs. It’s a bit hard to work out where exactly to draw the line, but somewhere between 0.2 and 0.5 seems about right. Then further analysis can be done on these ads to determine whether they are actually similar. Because few ads actually come from the same origin this means that it’s a pretty effective way of picking the few similar ads from a sea of different ads. On the other hand errors in the data (e.g. punctuation or case being changed in one copy) that can break otherwise long identical sequences of tokens which means you don’t want to set n too big.įor the ads I took a random sample of 2000 ads (producing just under 2 million pairs) and looked at the distribution of the Jaccard values (with log frequency).Īt a glance seems like the Jaccard index on 4-grams was pretty effective of separating out unrelated jobs from jobs that had a common origin. Intuitively for n = 1 it’s just common terms For n = 1 it’s just common words and it is likely this isn’t very separating (different documents can contain common words) however it’s unlikely that two documents containing lots of phrases of length 5 or 6 in common are a coincidence. How do we pick n? By the shingle inequality as we increase n the values will decrease, but it’s not clear at what value the data should be separated. So two documents will be similar if they contain the same phrases of length n tokens, irrespective of order. Instead we treat each document as a bag of n-grams (for some fixed n), and calculate the Jaccard index between them. I’ve looked before at using the edit distance which looks for the minimum number of changes to transform one text to another, but it’s slow to calculate. This works pretty well at finding near-duplicates and even ads from the same company although by itself it can’t detect duplicates. I’ve tried it on the Adzuna Job Salary Predictions Kaggle Competition with good success. Finding near-duplicate texts is a hard problem, but the Jaccard index for n-grams is an effective measure that’s efficient on small sets.
0 Comments
Leave a Reply. |
AuthorWrite something about yourself. No need to be fancy, just an overview. ArchivesCategories |