Chinese Version

This is a package for Chinese text segmentation, keyword extraction and speech tagging.

Features

Example

Text Segmentation

You can use worker() to initialize a worker, and then use [] or segment() to do the segmentation.

library(jiebaR)

##  Using default settings to initialize a worker.
cutter = worker()

###  Note: Can not display Chinese characters here.

segment( "This is a good day!" , cutter )
## [1] "This" "is"   "a"    "good" "day"
## OR cutter["This is a good day!"]

You can use file path as input.

segment( "./temp.dat" , cutter ) ### Auto encoding detection.
## [1] "temp" "dat"

You can initialize multiple engines simultaneously.

cutter2 = worker(type  = "mix", 
                 dict = "some_path/jieba.dict.utf8",
                 hmm   = "some_path/hmm_model.utf8",  
                 user  = "some_path/test.dict.utf8",
                 detect=T,      symbol = F,
                 lines = 1e+05, output = NULL
                 ) 
cutter2   ### Print information of worker
Worker Type:  Mix Segment

Detect Encoding :  TRUE
Default Encoding:  UTF-8
Keep Symbols    :  FALSE
Output Path     :  
Write File      :  TRUE
Max Read Lines  :  1e+05

Fixed Model Components:  

$dict
[1] "dict/jieba.dict.utf8"

$hmm
[1] "dict/hmm_model.utf8"

$user
[1] "dict/test.dict.utf8"

$detect $encoding $symbol $output $write $lines can be reset.

The public settings of the model can be modified by $ cutter$symbol = T. Private settings are fixed when the engine is initialized, and you can get them by cutter$PrivateVarible.

cutter$encoding
## [1] "UTF-8"
cutter$detect
## [1] TRUE
cutter$detect = F
cutter$detect
## [1] FALSE

You can use custom dictionar. jiebaR is able to identify new words, but adding your own new words can ensure a higher accuracy. imewlconverter is a good tools for dictionary construction.

show_dictpath() ### Show path
## [1] "/Library/Frameworks/R.framework/Versions/3.3/Resources/library/jiebaRD/dict"
?edit_dict()   ### For more information
## Using development documentation for edit_dict

Speech Tagging

Speech Tagging function [.tagger and tagging tag each word in a sentence after segmentation, using labels compatible with ictclas.

words = "hello world"
tagger = worker("tag")
tagger[words]
##     eng     eng 
## "hello" "world"

Keyword Extraction

Keyword Extraction worker use MixSegment model to cut word and use TF-IDF algorithm to find the keywords.

keys = worker("keywords", topn = 1)
keys <= "words of fun"
## 11.7392 
##   "fun"

Simhash Distance

Simhash worker can do keyword extraction and find the keywords from two inputs, and then computes Hamming distance between them.

 words = "hello world"
 simhasher = worker("simhash",topn=1)
 simhasher[words]
## $simhash
## [1] "3804341492420753273"
## 
## $keyword
## 11.7392 
## "hello"
distance("hello world" , "hello world!" , simhasher)
## $distance
## [1] 0
## 
## $lhs
## 11.7392 
## "hello" 
## 
## $rhs
## 11.7392 
## "hello"

More Docs

See https://jiebaR.qinwf.com/

More Information and Issues

https://github.com/qinwf/jiebaR

https://github.com/aszxqw/cppjieba