Safe Haskell | Safe-Inferred |
---|---|
Language | GHC2021 |
Symbolize
Description
Implementation of a global Symbol Table, with garbage collection.
Symbols, also known as Atoms or Interned Strings, are a common technique to reduce memory usage and improve performance when using many small strings:
A Symbol represents a string (any Textual
, so String, Text, ShortText, ByteString, ShortByteString, etc.)
Just like ShortText
, ShortByteString
and ByteArray
, a Symbol
has an optimized memory representation,
directly wrapping a primitive ByteArray#
.
Furthermore, a global symbol table keeps track of which values currently exist, ensuring we always deduplicate symbols. This therefore allows us to:
- Check for equality between symbols in constant-time (using pointer equality)
- Calculate the hash in constant-time (using
StableName
) - Keep the memory footprint of repeatedly-seen strings low.
This is very useful if you're frequently comparing strings
and the same strings might come up many times.
It also makes Symbol a great candidate for a key in e.g. a HashMap
or HashSet
.
The global symbol table is implemented using weak pointers, which means that unused symbols will be garbage collected. As such, you do not need to be concerned about memory leaks (as is the case with many other symbol table implementations).
Symbols are considered 'the same' regardless of whether they originate
from a String
, (lazy or strict, normal or short) Text
, (lazy or strict, normal or short) ByteString
etc.
The main advantages of Symbolize over other symbol table implementations are:
- Garbage collection: Symbols which are no longer used are automatically cleaned up.
- Support for any
Textual
type, includingString
, (strict and lazy)Text
, (strict and lazy)ByteString
,ShortText
,ShortByteString
, etc. - Great memory usage:
Symbol
s are simply a (lifted) wrapper around aByteArray#
, which is nicely unpacked by GHC.- The symbol table is an
IntMap
that contains weak pointers to these sameByteArray#
s and their associatedStableName#
s - Great performance:
unintern
is a simple pointer-dereference- calls to
lookup
are free of atomic memory barriers (and never have to wait on a concurrent thread runningintern
) - Thread-safe
Basic usage
This module is intended to be imported qualified, e.g.
import Symbolize (Symbol) import qualified Symbolize
To intern a string, use intern
:
>>>
hello = Symbolize.intern "hello"
>>>
world = Symbolize.intern "world"
>>>
(hello, world)
(Symbolize.intern "hello",Symbolize.intern "world")
Interning supports any Textual
type, so you can also use Text
or ByteString
etc.:
>>>
import Data.Text (Text)
>>>
niceCheeses = fmap Symbolize.intern (["Roquefort", "Camembert", "Brie"] :: [Text])
>>>
niceCheeses
[Symbolize.intern "Roquefort",Symbolize.intern "Camembert",Symbolize.intern "Brie"]
And if you are using OverloadedStrings
, you can use the IsString
instance to intern constants:
>>>
hello2 = ("hello" :: Symbol)
>>>
hello2
Symbolize.intern "hello">>>
Symbolize.intern ("world" :: Text)
Symbolize.intern "world"
Comparisons between symbols run in O(1) time:
>>>
hello == hello2
True>>>
hello == world
False
To get back the textual value of a symbol, use unintern
:
>>>
Symbolize.unintern hello
"hello"
If you want to check whether a string is currently interned, use lookup
:
>>>
Symbolize.lookup "hello"
Just (Symbolize.intern "hello")
Symbols make great keys for HashMap
and HashSet
.
Hashing them takes constant-time and they are guaranteed to be unique:
>>>
Data.Hashable.hash hello
1>>>
Data.Hashable.hash world
2>>>
fmap Data.Hashable.hash niceCheeses
[3,4,5]
For introspection, you can look at how many symbols currently exist:
>>>
System.Mem.performGC
>>>
Symbolize.globalSymbolTableSize
5>>>
[unintern (intern (show x)) | x <- [1..5]]
["1","2","3","4","5"]>>>
Symbolize.globalSymbolTableSize
10
Unused symbols will be garbage-collected, so you don't have to worry about memory leaks:
>>>
System.Mem.performGC
>>>
Symbolize.globalSymbolTableSize
5
For deeper introspection, you can look at the Show instance of the global symbol table: (Note that the exact format is subject to change.)
>>>
Symbolize.globalSymbolTable
GlobalSymbolTable { size = 5, symbols = ["Brie","Camembert","Roquefort","hello","world"] }
Synopsis
- data Symbol where
- intern :: Textual str => str -> Symbol
- internUnsafe :: Textual str => str -> Symbol
- unintern :: Textual str => Symbol -> str
- lookup :: (Textual str, MonadIO m) => str -> m (Maybe Symbol)
- data GlobalSymbolTable
- globalSymbolTable :: MonadIO m => m GlobalSymbolTable
- globalSymbolTableSize :: IO Word
- data Symbol#
- intern# :: Textual str => str -> Symbol#
- internUnsafe# :: Textual str => str -> Symbol#
- intern## :: ByteArray# -> Symbol#
- unintern# :: Textual str => Symbol# -> str
- unintern## :: Symbol# -> ByteArray#
- sameSymbol# :: Symbol# -> Symbol# -> Bool
- sameSymbol## :: Symbol# -> Symbol# -> Int#
- hashSymbol# :: Symbol# -> Int
- hashSymbol## :: Symbol# -> Int#
- compareSymbol# :: Symbol# -> Symbol# -> Ordering
Symbol
A string-like type with O(1) equality and comparison.
A Symbol represents a string (any Textual
, so String, Text, ShortText, ByteString, ShortByteString, etc.)
Just like ShortText
, ShortByteString
and ByteArray
, a Symbol
has an optimized memory representation,
directly wrapping a primitive ByteArray#
.
Furthermore, a global symbol table keeps track of which values currently exist, ensuring we always deduplicate symbols.
This therefore allows us to:
- Check for equality between symbols in constant-time (using pointer equality)
- Calculate the hash in constant-time (using StableName
)
- Keep the memory footprint of repeatedly-seen strings low.
This is very useful if you're frequently comparing strings
and the same strings might come up many times.
It also makes Symbol a great candidate for a key in e.g. a HashMap
or HashSet
.
The global symbol table is implemented using weak pointers, which means that unused symbols will be garbage collected. As such, you do not need to be concerned about memory leaks (as is the case with many other symbol table implementations).
Symbols are considered 'the same' regardless of whether they originate
from a String
, (lazy or strict, normal or short) Text
, (lazy or strict, normal or short) ByteString
etc.
Instances
Data Symbol Source # | This Since: 1.0.1.0 |
Defined in Symbolize Methods gfoldl :: (forall d b. Data d => c (d -> b) -> d -> c b) -> (forall g. g -> c g) -> Symbol -> c Symbol # gunfold :: (forall b r. Data b => c (b -> r) -> c r) -> (forall r. r -> c r) -> Constr -> c Symbol # toConstr :: Symbol -> Constr # dataTypeOf :: Symbol -> DataType # dataCast1 :: Typeable t => (forall d. Data d => c (t d)) -> Maybe (c Symbol) # dataCast2 :: Typeable t => (forall d e. (Data d, Data e) => c (t d e)) -> Maybe (c Symbol) # gmapT :: (forall b. Data b => b -> b) -> Symbol -> Symbol # gmapQl :: (r -> r' -> r) -> r -> (forall d. Data d => d -> r') -> Symbol -> r # gmapQr :: forall r r'. (r' -> r -> r) -> r -> (forall d. Data d => d -> r') -> Symbol -> r # gmapQ :: (forall d. Data d => d -> u) -> Symbol -> [u] # gmapQi :: Int -> (forall d. Data d => d -> u) -> Symbol -> u # gmapM :: Monad m => (forall d. Data d => d -> m d) -> Symbol -> m Symbol # gmapMp :: MonadPlus m => (forall d. Data d => d -> m d) -> Symbol -> m Symbol # gmapMo :: MonadPlus m => (forall d. Data d => d -> m d) -> Symbol -> m Symbol # | |
IsString Symbol Source # | |
Defined in Symbolize Methods fromString :: String -> Symbol # | |
Read Symbol Source # | To be a good citizen w.r.t both
|
Show Symbol Source # | |
Binary Symbol Source # | Uses the Since: 1.0.1.0 |
NFData Symbol Source # | The contents inside a |
Eq Symbol Source # | Equality checking takes only O(1) time, and is a simple pointer-equality check. |
Ord Symbol Source # | Symbols are ordered by their lexicographical UTF-8 representation. Therefore, comparison takes O(n) time. |
Hashable Symbol Source # | Hashing a
|
intern :: Textual str => str -> Symbol Source #
Intern a string-like value.
First converts the string to a ShortText
(if it isn't already one).
See Textual
for the type-specific time complexity of this.
Finally, it takes O(min(n, 64)) time to try to look up a matching symbol and insert it if it did not exist yet (where n is the number of symbols currently in the table).
Any concurrent calls to (the critical section in) intern
are synchronized.
internUnsafe :: Textual str => str -> Symbol Source #
Like intern
, but skips any potential UTF-8 validation
unintern :: Textual str => Symbol -> str Source #
Unintern a symbol, returning its textual value.
Looking up the Symbol's textual value takes O(1) time, as we simply follow its internal pointer.
Afterwards, the textual value is converted to the desired string type.
See Textual
for the type-specific time complexity of this.
Does not use the symbol table, so runs fully concurrently with any other functions manipulating it.
lookup :: (Textual str, MonadIO m) => str -> m (Maybe Symbol) Source #
Looks up a symbol in the global symbol table.
Returns Nothing
if no such symbol currently exists.
First converts the string to a ShortText
(if it isn't already one).
See Textual
for the type-specific time complexity of this.
Then, takes O(min(n, 64)) time, where n is the number of symbols currently in the table.
Runs concurrently with any other operation on the symbol table, without any atomic memory barriers.
Because the result can vary depending on the current state of the symbol table, this function is not pure.
Introspection & Metrics
data GlobalSymbolTable Source #
The global Symbol Table, containing a mapping between each symbol's textual representation and its deduplicated pointer.
You cannot manipulate the table itself directly,
but you can use globalSymbolTable
to get a handle to it and use its Show
instance for introspection.
globalSymbolTableSize
can similarly be used to get the current size of the table.
Current implementation details (these might change even between PVP-compatible versions):
- A `HashTable.Dictionary Int` (from the `vector-hashtables` library) is used for mapping $(SipHash text) -> weak symbol$.
- Since SipHash is used as hashing algorithm and the key that is used is randomized on global table initialization, the table is resistent to HashDoS attacks.
Instances
Show GlobalSymbolTable Source # | What exactly this |
Defined in Symbolize.SymbolTable Methods showsPrec :: Int -> GlobalSymbolTable -> ShowS # show :: GlobalSymbolTable -> String # showList :: [GlobalSymbolTable] -> ShowS # |
globalSymbolTable :: MonadIO m => m GlobalSymbolTable Source #
Get a handle to the GlobalSymbolTable
This can be used for pretty-printing using its Show
instance
globalSymbolTableSize :: IO Word Source #
Returns the current size of the global symbol table. Useful for introspection or metrics.
Should not be used in high-performance code, as it might walk over the full table.
manipulate unlifted Symbols directly
Unlifted version of Symbol
This is of kind UnliftedType
(AKA `TYPE (BoxedRep Unlifted)`)
which means GHC knows it is already fully-evaluated
and can never contain bottoms.
In many cases, GHC is able to figure out
that the symbols you're using are already-evaluated
and will unbox them into Symbol#
s automatically
behind the scenes.
However, in some cases, directly manipulating a Symbol#
can be beneficial, such as when storing them as keys or values
inside a collection that supports UnliftedType
s.
A Symbol#
has exactly the same in-memory representation
as a ByteArray#
(It is an unlifted newtype around ByteArray#
).
internUnsafe# :: Textual str => str -> Symbol# Source #
Like intern#
, but skips any potential UTF-8 validation
intern## :: ByteArray# -> Symbol# Source #
Deprecated: Renamed to internUnsafe##
unintern## :: Symbol# -> ByteArray# Source #
sameSymbol# :: Symbol# -> Symbol# -> Bool Source #
Equality checking on unlifted symbols.
This takes only O(1) time, and is a simple pointer-equality check.
sameSymbol## :: Symbol# -> Symbol# -> Int# Source #
Version of sameSymbol#
that returns the an Int#
, an unlifted Bool
hashSymbol# :: Symbol# -> Int Source #
Hash an unlifted Symbol#
Takes O(1) and results in zero collisions, as StableName
s are used.
hashSymbol## :: Symbol# -> Int# Source #