symbolize-1.0.3.1: Efficient global Symbol table, with Garbage Collection.
Safe HaskellSafe-Inferred
LanguageGHC2021

Symbolize

Description

Implementation of a global Symbol Table, with garbage collection.

Symbols, also known as Atoms or Interned Strings, are a common technique to reduce memory usage and improve performance when using many small strings:

A Symbol represents a string (any Textual, so String, Text, ShortText, ByteString, ShortByteString, etc.)

Just like ShortText, ShortByteString and ByteArray, a Symbol has an optimized memory representation, directly wrapping a primitive ByteArray#.

Furthermore, a global symbol table keeps track of which values currently exist, ensuring we always deduplicate symbols. This therefore allows us to:

  • Check for equality between symbols in constant-time (using pointer equality)
  • Calculate the hash in constant-time (using StableName)
  • Keep the memory footprint of repeatedly-seen strings low.

This is very useful if you're frequently comparing strings and the same strings might come up many times. It also makes Symbol a great candidate for a key in e.g. a HashMap or HashSet.

The global symbol table is implemented using weak pointers, which means that unused symbols will be garbage collected. As such, you do not need to be concerned about memory leaks (as is the case with many other symbol table implementations).

Symbols are considered 'the same' regardless of whether they originate from a String, (lazy or strict, normal or short) Text, (lazy or strict, normal or short) ByteString etc.

The main advantages of Symbolize over other symbol table implementations are:

  • Garbage collection: Symbols which are no longer used are automatically cleaned up.
  • Support for any Textual type, including String, (strict and lazy) Text, (strict and lazy) ByteString, ShortText, ShortByteString, etc.
  • Great memory usage:
  • Symbols are simply a (lifted) wrapper around a ByteArray#, which is nicely unpacked by GHC.
  • The symbol table is an IntMap that contains weak pointers to these same ByteArray#s and their associated StableName#s
  • Great performance:
  • unintern is a simple pointer-dereference
  • calls to lookup are free of atomic memory barriers (and never have to wait on a concurrent thread running intern)
  • Thread-safe

Basic usage

This module is intended to be imported qualified, e.g.

import Symbolize (Symbol)
import qualified Symbolize

To intern a string, use intern:

>>> hello = Symbolize.intern "hello"
>>> world = Symbolize.intern "world"
>>> (hello, world)
(Symbolize.intern "hello",Symbolize.intern "world")

Interning supports any Textual type, so you can also use Text or ByteString etc.:

>>> import Data.Text (Text)
>>> niceCheeses = fmap Symbolize.intern (["Roquefort", "Camembert", "Brie"] :: [Text])
>>> niceCheeses
[Symbolize.intern "Roquefort",Symbolize.intern "Camembert",Symbolize.intern "Brie"]

And if you are using OverloadedStrings, you can use the IsString instance to intern constants:

>>> hello2 = ("hello" :: Symbol)
>>> hello2
Symbolize.intern "hello"
>>> Symbolize.intern ("world" :: Text)
Symbolize.intern "world"

Comparisons between symbols run in O(1) time:

>>> hello == hello2
True
>>> hello == world
False

To get back the textual value of a symbol, use unintern:

>>> Symbolize.unintern hello
"hello"

If you want to check whether a string is currently interned, use lookup:

>>> Symbolize.lookup "hello"
Just (Symbolize.intern "hello")

Symbols make great keys for HashMap and HashSet. Hashing them takes constant-time and they are guaranteed to be unique:

>>> Data.Hashable.hash hello
1
>>> Data.Hashable.hash world
2
>>> fmap Data.Hashable.hash niceCheeses
[3,4,5]

For introspection, you can look at how many symbols currently exist:

>>> System.Mem.performGC
>>> Symbolize.globalSymbolTableSize
5
>>> [unintern (intern (show x)) | x <- [1..5]]
["1","2","3","4","5"]
>>> Symbolize.globalSymbolTableSize
10

Unused symbols will be garbage-collected, so you don't have to worry about memory leaks:

>>> System.Mem.performGC
>>> Symbolize.globalSymbolTableSize
5

For deeper introspection, you can look at the Show instance of the global symbol table: (Note that the exact format is subject to change.)

>>> Symbolize.globalSymbolTable
GlobalSymbolTable { size = 5, symbols = ["Brie","Camembert","Roquefort","hello","world"] }
Synopsis

Symbol

data Symbol where Source #

A string-like type with O(1) equality and comparison.

A Symbol represents a string (any Textual, so String, Text, ShortText, ByteString, ShortByteString, etc.)

Just like ShortText, ShortByteString and ByteArray, a Symbol has an optimized memory representation, directly wrapping a primitive ByteArray#.

Furthermore, a global symbol table keeps track of which values currently exist, ensuring we always deduplicate symbols. This therefore allows us to: - Check for equality between symbols in constant-time (using pointer equality) - Calculate the hash in constant-time (using StableName) - Keep the memory footprint of repeatedly-seen strings low.

This is very useful if you're frequently comparing strings and the same strings might come up many times. It also makes Symbol a great candidate for a key in e.g. a HashMap or HashSet.

The global symbol table is implemented using weak pointers, which means that unused symbols will be garbage collected. As such, you do not need to be concerned about memory leaks (as is the case with many other symbol table implementations).

Symbols are considered 'the same' regardless of whether they originate from a String, (lazy or strict, normal or short) Text, (lazy or strict, normal or short) ByteString etc.

Constructors

Symbol :: Symbol# -> Symbol 

Instances

Instances details
Data Symbol Source #

This Data instance follows the same implementation as the one for Text and ShortText: It pretends Symbol is a a collection holding a `[Char]`.

Since: 1.0.1.0

Instance details

Defined in Symbolize

Methods

gfoldl :: (forall d b. Data d => c (d -> b) -> d -> c b) -> (forall g. g -> c g) -> Symbol -> c Symbol #

gunfold :: (forall b r. Data b => c (b -> r) -> c r) -> (forall r. r -> c r) -> Constr -> c Symbol #

toConstr :: Symbol -> Constr #

dataTypeOf :: Symbol -> DataType #

dataCast1 :: Typeable t => (forall d. Data d => c (t d)) -> Maybe (c Symbol) #

dataCast2 :: Typeable t => (forall d e. (Data d, Data e) => c (t d e)) -> Maybe (c Symbol) #

gmapT :: (forall b. Data b => b -> b) -> Symbol -> Symbol #

gmapQl :: (r -> r' -> r) -> r -> (forall d. Data d => d -> r') -> Symbol -> r #

gmapQr :: forall r r'. (r' -> r -> r) -> r -> (forall d. Data d => d -> r') -> Symbol -> r #

gmapQ :: (forall d. Data d => d -> u) -> Symbol -> [u] #

gmapQi :: Int -> (forall d. Data d => d -> u) -> Symbol -> u #

gmapM :: Monad m => (forall d. Data d => d -> m d) -> Symbol -> m Symbol #

gmapMp :: MonadPlus m => (forall d. Data d => d -> m d) -> Symbol -> m Symbol #

gmapMo :: MonadPlus m => (forall d. Data d => d -> m d) -> Symbol -> m Symbol #

IsString Symbol Source # 
Instance details

Defined in Symbolize

Methods

fromString :: String -> Symbol #

Read Symbol Source #

To be a good citizen w.r.t both Show and IsString, reading is supported two ways:

>>> read @Symbol "Symbolize.intern \"Haskell\""
Symbolize.intern "Haskell"
>>> read @Symbol "\"Curry\""
Symbolize.intern "Curry"
Instance details

Defined in Symbolize

Show Symbol Source # 
Instance details

Defined in Symbolize

Binary Symbol Source #

Uses the ShortByteString instance of Textual under the hood; invalid UTF-8 is replaced by the Unicode replacement character.

Since: 1.0.1.0

Instance details

Defined in Symbolize

Methods

put :: Symbol -> Put #

get :: Get Symbol #

putList :: [Symbol] -> Put #

NFData Symbol Source #

The contents inside a Symbol are always guaranteed to be evaluated, so this only forces the outermost constructor using seq.

Instance details

Defined in Symbolize

Methods

rnf :: Symbol -> () #

Eq Symbol Source #

Equality checking takes only O(1) time, and is a simple pointer-equality check.

Instance details

Defined in Symbolize

Methods

(==) :: Symbol -> Symbol -> Bool #

(/=) :: Symbol -> Symbol -> Bool #

Ord Symbol Source #

Symbols are ordered by their lexicographical UTF-8 representation.

Therefore, comparison takes O(n) time.

Instance details

Defined in Symbolize

Hashable Symbol Source #

Hashing a Symbol is very fast:

hash takes O(1) and results in zero collisions, as StableNames are used.

hashWithSalt takes O(1) time; just as long as hashWithSalt-ing any other Int.

Instance details

Defined in Symbolize

Methods

hashWithSalt :: Int -> Symbol -> Int #

hash :: Symbol -> Int #

intern :: Textual str => str -> Symbol Source #

Intern a string-like value.

First converts the string to a ShortText (if it isn't already one). See Textual for the type-specific time complexity of this.

Finally, it takes O(min(n, 64)) time to try to look up a matching symbol and insert it if it did not exist yet (where n is the number of symbols currently in the table).

Any concurrent calls to (the critical section in) intern are synchronized.

internUnsafe :: Textual str => str -> Symbol Source #

Like intern, but skips any potential UTF-8 validation

unintern :: Textual str => Symbol -> str Source #

Unintern a symbol, returning its textual value.

Looking up the Symbol's textual value takes O(1) time, as we simply follow its internal pointer.

Afterwards, the textual value is converted to the desired string type. See Textual for the type-specific time complexity of this.

Does not use the symbol table, so runs fully concurrently with any other functions manipulating it.

lookup :: (Textual str, MonadIO m) => str -> m (Maybe Symbol) Source #

Looks up a symbol in the global symbol table.

Returns Nothing if no such symbol currently exists.

First converts the string to a ShortText (if it isn't already one). See Textual for the type-specific time complexity of this.

Then, takes O(min(n, 64)) time, where n is the number of symbols currently in the table.

Runs concurrently with any other operation on the symbol table, without any atomic memory barriers.

Because the result can vary depending on the current state of the symbol table, this function is not pure.

Introspection & Metrics

data GlobalSymbolTable Source #

The global Symbol Table, containing a mapping between each symbol's textual representation and its deduplicated pointer.

You cannot manipulate the table itself directly, but you can use globalSymbolTable to get a handle to it and use its Show instance for introspection.

globalSymbolTableSize can similarly be used to get the current size of the table.

Current implementation details (these might change even between PVP-compatible versions):

  • A `HashTable.Dictionary Int` (from the `vector-hashtables` library) is used for mapping $(SipHash text) -> weak symbol$.
  • Since SipHash is used as hashing algorithm and the key that is used is randomized on global table initialization, the table is resistent to HashDoS attacks.

Instances

Instances details
Show GlobalSymbolTable Source #

What exactly this Show instance prints might change between PVP-compatible versions

Instance details

Defined in Symbolize.SymbolTable

globalSymbolTable :: MonadIO m => m GlobalSymbolTable Source #

Get a handle to the GlobalSymbolTable

This can be used for pretty-printing using its Show instance

globalSymbolTableSize :: IO Word Source #

Returns the current size of the global symbol table. Useful for introspection or metrics.

Should not be used in high-performance code, as it might walk over the full table.

manipulate unlifted Symbols directly

data Symbol# Source #

Unlifted version of Symbol

This is of kind UnliftedType (AKA `TYPE (BoxedRep Unlifted)`) which means GHC knows it is already fully-evaluated and can never contain bottoms.

In many cases, GHC is able to figure out that the symbols you're using are already-evaluated and will unbox them into Symbol#s automatically behind the scenes.

However, in some cases, directly manipulating a Symbol# can be beneficial, such as when storing them as keys or values inside a collection that supports UnliftedTypes.

A Symbol# has exactly the same in-memory representation as a ByteArray# (It is an unlifted newtype around ByteArray#).

intern# :: Textual str => str -> Symbol# Source #

Version of intern that returns an unlifted Symbol#

internUnsafe# :: Textual str => str -> Symbol# Source #

Like intern#, but skips any potential UTF-8 validation

intern## :: ByteArray# -> Symbol# Source #

Deprecated: Renamed to internUnsafe##

unintern# :: Textual str => Symbol# -> str Source #

Version of unintern that works directly on an unlifted Symbol#

unintern## :: Symbol# -> ByteArray# Source #

Version of unintern that works directly on an unlifted Symbol# and returns the internal `ByteArray#

sameSymbol# :: Symbol# -> Symbol# -> Bool Source #

Equality checking on unlifted symbols.

This takes only O(1) time, and is a simple pointer-equality check.

sameSymbol## :: Symbol# -> Symbol# -> Int# Source #

Version of sameSymbol# that returns the an Int#, an unlifted Bool

hashSymbol# :: Symbol# -> Int Source #

Hash an unlifted Symbol#

Takes O(1) and results in zero collisions, as StableNames are used.

hashSymbol## :: Symbol# -> Int# Source #

Hash an unlifted Symbol#, returning an unlifted Int#

Takes O(1) and results in zero collisions, as StableNames are used.

compareSymbol# :: Symbol# -> Symbol# -> Ordering Source #

Compare two unlifted symbols

The symbols are compared lexicographically using their UTF-8 representation, so this takes linear time.