Apportita is a partial R port from plasticityai/magnitude, which is a fast, simple utility library for handling vector embeddings.
Apportita would cover only partial features of the original Magnitude library. The main goal of this package is to enable access to user’s local magnitude data store. In other words, apportita would not support streaming access to remote sqlite files.
The package mainly targets the range where the magnitude-light library does, thus some functionalities would not be implemented.
Usage
Construct a Magnitude connection
library(apportita)
## apportita includes a sample magnitude file
## trained with `movie_review` dataset from 'text2vec' package.
conn <- apportita::magnitude(system.file("magnitude/w2v_en_sample.magnitude", package = "apportita"))
dim(conn) ## check the dimension of Magnitude table
#> [1] 12711 20
Remember to close the connection after querying to the Magnitude table.
close(conn)
Querying
apportita::query(conn, c("i", "watch", "a", "movie"))
#> # A tibble: 4 × 21
#> key dim_0 dim_1 dim_2 dim_3 dim_4 dim_5 dim_6 dim_7 dim_8 dim_9
#> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 a 0.450 -0.183 0.197 -0.0840 -0.117 -0.134 -0.0380 0.200 0.232 -0.248
#> 2 i 0.453 -0.102 0.152 0.0377 -0.0414 -0.121 -0.181 0.115 0.0691 -0.316
#> 3 movie 0.277 0.0376 -0.144 -0.391 -0.135 -0.111 0.100 0.242 0.295 -0.147
#> 4 watch 0.327 0.172 -0.266 -0.0383 -0.190 -0.233 -0.343 0.206 0.127 -0.173
#> # … with 10 more variables: dim_10 <dbl>, dim_11 <dbl>, dim_12 <dbl>,
#> # dim_13 <dbl>, dim_14 <dbl>, dim_15 <dbl>, dim_16 <dbl>, dim_17 <dbl>,
#> # dim_18 <dbl>, dim_19 <dbl>
apportita::doesnt_match(conn, "book", c("i", "love", "movie"), n = 3)
#> # A tibble: 3 × 2
#> keys distance
#> <chr> <dbl>
#> 1 i 0.948
#> 2 love 0.746
#> 3 movie 0.609
apportita::most_similar(conn, "book", c("i", "love", "movie"), n = 3)
#> # A tibble: 3 × 2
#> keys similarity
#> <chr> <dbl>
#> 1 movie 0.815
#> 2 love 0.722
#> 3 i 0.551
Calculate distance/similarity from words to words
Apportita supports methods provided in the proxyC package for calculating distances and similarities.
apportita::calc_dist(conn, "book", c("i", "love", "movie"))
#> 1 x 3 sparse Matrix of class "dgTMatrix"
#> i love movie
#> book 0.9480003 0.7461417 0.6085849
apportita::calc_simil(conn, "book", c("i", "love", "movie"))
#> 1 x 3 sparse Matrix of class "dgTMatrix"
#> i love movie
#> book 0.5506478 0.7216363 0.8148123
Experimentally, apportita::calc_wrd
also supports computing the Word Rotator’s Distance, which is a textual similarity measure based on optimal transport, presented in https://arxiv.org/abs/2004.15003.
Slicing samples
apportita::slice_n(conn, n = 2, offset = 5)
#> # A tibble: 2 × 21
#> key dim_0 dim_1 dim_2 dim_3 dim_4 dim_5 dim_6 dim_7 dim_8 dim_9
#> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 marsh -0.0276 0.257 0.223 -0.172 0.0647 -0.158 0.0134 0.156 0.203 -0.311
#> 2 jonny -0.0681 -0.156 0.0627 0.0813 0.379 -0.209 0.116 -0.0220 0.251 -0.224
#> # … with 10 more variables: dim_10 <dbl>, dim_11 <dbl>, dim_12 <dbl>,
#> # dim_13 <dbl>, dim_14 <dbl>, dim_15 <dbl>, dim_16 <dbl>, dim_17 <dbl>,
#> # dim_18 <dbl>, dim_19 <dbl>
apportita::slice_index(conn, index = c(20, 100, 600))
#> # A tibble: 3 × 21
#> key dim_0 dim_1 dim_2 dim_3 dim_4 dim_5 dim_6 dim_7 dim_8 dim_9
#> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 yawni… -0.185 0.354 0.120 0.0251 -0.269 0.278 0.130 0.345 0.277 0.0802
#> 2 dane -0.245 -0.329 0.353 0.382 -0.288 0.0833 0.0476 -0.254 0.186 0.385
#> 3 drenc… 0.186 -0.121 -0.00184 -0.343 -0.268 0.235 0.0297 0.333 0.286 0.259
#> # … with 10 more variables: dim_10 <dbl>, dim_11 <dbl>, dim_12 <dbl>,
#> # dim_13 <dbl>, dim_14 <dbl>, dim_15 <dbl>, dim_16 <dbl>, dim_17 <dbl>,
#> # dim_18 <dbl>, dim_19 <dbl>
apportita::slice_frac(conn, frac = .01)
#> # A tibble: 127 × 21
#> key dim_0 dim_1 dim_2 dim_3 dim_4 dim_5 dim_6 dim_7
#> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 framing -0.217 -0.219 -0.222 0.163 -0.283 -0.0904 -0.0736 -0.363
#> 2 amusement -0.0711 -0.0767 0.0619 0.351 -0.277 -0.118 -0.169 -0.0786
#> 3 does 0.339 0.0933 -0.143 0.131 0.345 -0.0985 0.0791 0.167
#> 4 devilish 0.376 0.261 -0.0357 0.00566 -0.169 -0.214 0.232 -0.305
#> 5 eve -0.0545 -0.179 0.214 0.246 0.357 0.305 0.280 -0.0350
#> 6 robbins 0.276 0.323 -0.251 -0.171 -0.140 -0.178 0.00751 -0.161
#> 7 hurts 0.0422 0.415 -0.0246 0.116 0.205 -0.325 0.279 0.140
#> 8 exists -0.285 -0.286 -0.319 0.180 0.455 0.132 0.0227 -0.387
#> 9 concert -0.105 0.336 -0.0863 0.196 -0.145 -0.285 0.155 -0.0114
#> 10 kennedy 0.192 0.123 -0.129 -0.314 0.0686 -0.0938 -0.0158 0.209
#> # … with 117 more rows, and 12 more variables: dim_8 <dbl>, dim_9 <dbl>,
#> # dim_10 <dbl>, dim_11 <dbl>, dim_12 <dbl>, dim_13 <dbl>, dim_14 <dbl>,
#> # dim_15 <dbl>, dim_16 <dbl>, dim_17 <dbl>, dim_18 <dbl>, dim_19 <dbl>