This is lib-text, a little text processing library. It supports language identification, tokenization, stopword filtering and provides some useful helper functions. The tokenization has been tuned to work well with text conventions commonly used in social media such as Twitter, and supports URLs, emoji, hashtags, emails and @-mentions cleanly. Stopword filtering is currently supported for
With more to come.
Add to your build.sbt file:
resolvers += "peoplepattern" at "https://dl.bintray.com/peoplepattern/maven/"
libraryDependencies += "com.peoplepattern" %% "lib-text" % "0.3"
import com.peoplepattern.text.Implicits._
val txt = "Did you get your personalised print with your copy of #MadeintheAM on Black Friday? If not, there's still time! http://www.myplaydirect.com/one-direction"
txt.lang
// Some(en)
txt.tokens
// Vector(Did, you, get, your, personalised, print, with, your, copy, of, #MadeintheAM, on, Black, Friday, ?, If, not, ,, there's, still, time, !, http://www.myplaydirect.com/one-direction)
txt.terms
// Set(print, personalised, black, copy, friday, time)
txt.termsPlus
// Set(print, personalised, black, #madeintheam, copy, friday, time)
txt.termBigrams
// Set(black friday, personalised print)
Full API docs are available, for each published version:
Developed with ❤️ at People Pattern Corporation