[Git][ghc/ghc][master] ucd2haskell: remove Streamly dependency + misc
Marge Bot (@marge-bot)
gitlab at gitlab.haskell.org
Thu Jun 13 19:36:16 UTC 2024
Marge Bot pushed to branch master at Glasgow Haskell Compiler / GHC
Commits:
7b23ce8b by Pierre Le Marre at 2024-06-13T15:35:04-04:00
ucd2haskell: remove Streamly dependency + misc
- Remove dead code.
- Remove `streamly` dependency.
- Process files with `bytestring`.
- Replace Unicode files parsers with the corresponding ones from the
package `unicode-data-parser`.
- Simplify cabal file and rename module
- Regenerate `ghc-internal` Unicode files with new header
- - - - -
10 changed files:
- libraries/ghc-internal/src/GHC/Internal/Unicode/Char/DerivedCoreProperties.hs
- libraries/ghc-internal/src/GHC/Internal/Unicode/Char/UnicodeData/GeneralCategory.hs
- libraries/ghc-internal/src/GHC/Internal/Unicode/Char/UnicodeData/SimpleLowerCaseMapping.hs
- libraries/ghc-internal/src/GHC/Internal/Unicode/Char/UnicodeData/SimpleTitleCaseMapping.hs
- libraries/ghc-internal/src/GHC/Internal/Unicode/Char/UnicodeData/SimpleUpperCaseMapping.hs
- libraries/ghc-internal/src/GHC/Internal/Unicode/Version.hs
- − libraries/ghc-internal/tools/ucd2haskell/exe/Parser/Text.hs
- libraries/ghc-internal/tools/ucd2haskell/exe/UCD2Haskell.hs
- + libraries/ghc-internal/tools/ucd2haskell/exe/UCD2Haskell/ModuleGenerators.hs
- libraries/ghc-internal/tools/ucd2haskell/ucd2haskell.cabal
Changes:
=====================================
libraries/ghc-internal/src/GHC/Internal/Unicode/Char/DerivedCoreProperties.hs
=====================================
@@ -8,9 +8,8 @@
-----------------------------------------------------------------------------
-- |
-- Module : GHC.Internal.Unicode.Char.DerivedCoreProperties
--- Copyright : (c) 2020 Composewell Technologies and Contributors
-- License : BSD-3-Clause
--- Maintainer : streamly at composewell.com
+-- Maintainer : The GHC Developers <ghc-devs at haskell.org>
-- Stability : internal
-----------------------------------------------------------------------------
=====================================
libraries/ghc-internal/src/GHC/Internal/Unicode/Char/UnicodeData/GeneralCategory.hs
=====================================
@@ -8,9 +8,8 @@
-----------------------------------------------------------------------------
-- |
-- Module : GHC.Internal.Unicode.Char.UnicodeData.GeneralCategory
--- Copyright : (c) 2020 Composewell Technologies and Contributors
-- License : BSD-3-Clause
--- Maintainer : streamly at composewell.com
+-- Maintainer : The GHC Developers <ghc-devs at haskell.org>
-- Stability : internal
-----------------------------------------------------------------------------
=====================================
libraries/ghc-internal/src/GHC/Internal/Unicode/Char/UnicodeData/SimpleLowerCaseMapping.hs
=====================================
@@ -7,9 +7,8 @@
-----------------------------------------------------------------------------
-- |
-- Module : GHC.Internal.Unicode.Char.UnicodeData.SimpleLowerCaseMapping
--- Copyright : (c) 2020 Composewell Technologies and Contributors
-- License : BSD-3-Clause
--- Maintainer : streamly at composewell.com
+-- Maintainer : The GHC Developers <ghc-devs at haskell.org>
-- Stability : internal
-----------------------------------------------------------------------------
=====================================
libraries/ghc-internal/src/GHC/Internal/Unicode/Char/UnicodeData/SimpleTitleCaseMapping.hs
=====================================
@@ -7,9 +7,8 @@
-----------------------------------------------------------------------------
-- |
-- Module : GHC.Internal.Unicode.Char.UnicodeData.SimpleTitleCaseMapping
--- Copyright : (c) 2020 Composewell Technologies and Contributors
-- License : BSD-3-Clause
--- Maintainer : streamly at composewell.com
+-- Maintainer : The GHC Developers <ghc-devs at haskell.org>
-- Stability : internal
-----------------------------------------------------------------------------
=====================================
libraries/ghc-internal/src/GHC/Internal/Unicode/Char/UnicodeData/SimpleUpperCaseMapping.hs
=====================================
@@ -7,9 +7,8 @@
-----------------------------------------------------------------------------
-- |
-- Module : GHC.Internal.Unicode.Char.UnicodeData.SimpleUpperCaseMapping
--- Copyright : (c) 2020 Composewell Technologies and Contributors
-- License : BSD-3-Clause
--- Maintainer : streamly at composewell.com
+-- Maintainer : The GHC Developers <ghc-devs at haskell.org>
-- Stability : internal
-----------------------------------------------------------------------------
=====================================
libraries/ghc-internal/src/GHC/Internal/Unicode/Version.hs
=====================================
@@ -6,9 +6,8 @@
-----------------------------------------------------------------------------
-- |
-- Module : GHC.Internal.Unicode.Version
--- Copyright : (c) 2020 Composewell Technologies and Contributors
-- License : BSD-3-Clause
--- Maintainer : streamly at composewell.com
+-- Maintainer : The GHC Developers <ghc-devs at haskell.org>
-- Stability : internal
-----------------------------------------------------------------------------
=====================================
libraries/ghc-internal/tools/ucd2haskell/exe/Parser/Text.hs deleted
=====================================
@@ -1,1127 +0,0 @@
-{-# LANGUAGE LambdaCase #-}
-
--- |
--- Module : Parser.Text
--- Copyright : (c) 2020 Composewell Technologies and Contributors
--- (c) 2016-2017 Harendra Kumar
--- (c) 2014-2015 Antonio Nikishaev
--- License : BSD-3-Clause
--- Maintainer : streamly at composewell.com
--- Stability : internal
-
--- This code was taken from https://github.com/composewell/unicode-data.
--- The original Unicode database parser was taken from
--- https://github.com/composewell/unicode-transforms but was completely
--- rewritten from scratch to parse from UCD text files instead of XML, only
--- some types remain the same. That code in turn was originally taken from
--- https://github.com/llelf/prose (Antonio Nikishaev) and heavily modified by
--- Harendra Kumar.
---
-module Parser.Text (genModules) where
-
-import Control.Exception (catch, IOException)
-import Control.Monad (void)
-import Data.Bits (Bits(..))
-import Data.Word (Word8)
-import Data.Char (chr, ord, isSpace)
-import Data.Functor ((<&>))
-import Data.Function ((&))
-import Data.List (intersperse, unfoldr)
-import Data.List.Split (splitWhen)
-import Numeric (showHex)
-import Streamly.Data.Fold (Fold)
-import System.Directory (createDirectoryIfMissing)
-import System.Environment (getEnv)
-import System.FilePath ((</>), (<.>))
-
--- import qualified Data.Set as Set
-import Streamly.Data.Stream (Stream)
-import qualified Streamly.Data.Stream.Prelude as Stream
-import qualified Streamly.Data.Fold as Fold
-import qualified Streamly.Internal.Data.Fold as Fold
-import qualified Streamly.Data.Unfold as Unfold
-import qualified Streamly.FileSystem.Handle as Handle
-import qualified Streamly.Unicode.Stream as Unicode
-import qualified Streamly.Internal.Unicode.Stream as Unicode
-import qualified System.IO as Sys
-
-import Prelude hiding (pred)
-
--------------------------------------------------------------------------------
--- Types
--------------------------------------------------------------------------------
-
-data GeneralCategory =
- Lu|Ll|Lt| --LC
- Lm|Lo| --L
- Mn|Mc|Me| --M
- Nd|Nl|No| --N
- Pc|Pd|Ps|Pe|Pi|Pf|Po| --P
- Sm|Sc|Sk|So| --S
- Zs|Zl|Zp| --Z
- Cc|Cf|Cs|Co|Cn --C
- deriving (Show, Bounded, Enum, Read)
-
-data DecompType =
- DTCanonical | DTCompat | DTFont
- | DTNoBreak | DTInitial | DTMedial | DTFinal
- | DTIsolated | DTCircle | DTSuper | DTSub
- | DTVertical | DTWide | DTNarrow
- | DTSmall | DTSquare | DTFraction
- deriving (Show, Eq)
-
-data Decomp = DCSelf | DC [Char] deriving (Show, Eq)
-
--- data DType = Canonical | Kompat
-
-data DetailedChar =
- DetailedChar
- { _char :: Char
- , _name :: String
- , _generalCategory :: GeneralCategory
- , _combiningClass :: Int
- , _decompositionType :: Maybe DecompType
- , _decomposition :: Decomp
- , _simpleUppercaseMapping :: Maybe Char
- , _simpleLowercaseMapping :: Maybe Char
- , _simpleTitlecaseMapping :: Maybe Char
- }
- deriving (Show)
-
-{- [NOTE] Used by disabled generator
-
--- See: https://www.unicode.org/reports/tr44/#Default_Values_Table
-mkDefaultDetailedChar :: Char -> DetailedChar
-mkDefaultDetailedChar c = DetailedChar
- { _char = c
- , _name = mempty
- , _generalCategory = Cn
- , _combiningClass = 0
- , _decompositionType = Nothing
- , _decomposition = DCSelf
- , _simpleUppercaseMapping = Nothing
- , _simpleLowercaseMapping = Nothing
- , _simpleTitlecaseMapping = Nothing }
--}
-
--------------------------------------------------------------------------------
--- Helpers
--------------------------------------------------------------------------------
-
-headerRule :: String
-headerRule = "-----------------------------------------------------------------------------"
-
-mkModuleHeader :: String -> String
-mkModuleHeader modName =
- unlines
- [ headerRule
- , "-- |"
- , "-- Module : " <> modName
- , "-- Copyright : (c) 2020 Composewell Technologies and Contributors"
- , "-- License : BSD-3-Clause"
- -- [FIXME] Update maintainer
- , "-- Maintainer : streamly at composewell.com"
- , "-- Stability : internal"
- , headerRule
- ]
-
-readCodePoint :: String -> Char
-readCodePoint = chr . read . ("0x"<>)
-
-readCodePointM :: String -> Maybe Char
-readCodePointM "" = Nothing
-readCodePointM u = Just (readCodePoint u)
-
-genSignature :: String -> String
-genSignature = (<> " :: Char -> Bool")
-
--- | Check that var is between minimum and maximum of orderList
-genRangeCheck :: String -> [Int] -> String
-genRangeCheck var ordList =
- var
- <> " >= "
- <> show (minimum ordList)
- <> " && " <> var <> " <= " <> show (maximum ordList)
-
-genBitmap :: String -> [Int] -> String
-genBitmap funcName ordList =
- unlines
- [ "{-# INLINE " <> funcName <> " #-}"
- , genSignature funcName
- , funcName <> " = \\c -> let n = ord c in "
- <> genRangeCheck "n" ordList <> " && lookupBit64 bitmap# n"
- , " where"
- , " bitmap# = \"" <> bitMapToAddrLiteral (positionsToBitMap ordList) "\"#"
- ]
-
-positionsToBitMap :: [Int] -> [Bool]
-positionsToBitMap = go 0
-
- where
-
- go _ [] = []
- go i xxs@(x:xs)
- | i < x = False : go (i + 1) xxs
- | otherwise = True : go (i + 1) xs
-
-bitMapToAddrLiteral ::
- -- | Values to encode
- [Bool] ->
- -- | String to append
- String ->
- String
-bitMapToAddrLiteral bs cs = foldr encode cs (unfoldr mkChunks bs)
-
- where
-
- mkChunks :: [a] -> Maybe ([a], [a])
- mkChunks [] = Nothing
- mkChunks xs = Just $ splitAt 8 xs
-
- encode :: [Bool] -> String -> String
- encode chunk acc = '\\' : shows (toByte (padTo8 chunk)) acc
-
- padTo8 :: [Bool] -> [Bool]
- padTo8 xs
- | length xs >= 8 = xs
- | otherwise = xs <> replicate (8 - length xs) False
-
- toByte :: [Bool] -> Int
- toByte xs = sum $ map (\i -> if xs !! i then 1 `shiftL` i else 0) [0..7]
-
-genEnumBitmap ::
- forall a. (Bounded a, Enum a, Show a) =>
- -- | Function name
- String ->
- -- | Default value
- a ->
- -- | List of values to encode
- [a] ->
- String
-genEnumBitmap funcName def as = unlines
- [ "{-# INLINE " <> funcName <> " #-}"
- , funcName <> " :: Char -> Int"
- , funcName <> " c = let n = ord c in if n >= "
- <> show (length as)
- <> " then "
- <> show (fromEnum def)
- <> " else lookup_bitmap n"
-
- , "{-# NOINLINE lookup_bitmap #-}"
- , "lookup_bitmap :: Int -> Int"
- , "lookup_bitmap n = lookupIntN bitmap# n"
- , " where"
- , " bitmap# = \"" <> enumMapToAddrLiteral as "\"#"
- ]
-
-{-| Encode a list of values as a byte map, using their 'Enum' instance.
-
-__Note:__ 'Enum' instance must respect the following:
-
-* @fromEnum minBound >= 0x00@
-* @fromEnum maxBound <= 0xff@
--}
-enumMapToAddrLiteral ::
- forall a. (Bounded a, Enum a, Show a) =>
- -- | Values to encode
- [a] ->
- -- | String to append
- String ->
- String
-enumMapToAddrLiteral xs cs = foldr go cs xs
-
- where
-
- go :: a -> String -> String
- go x acc = '\\' : shows (toWord8 x) acc
-
- toWord8 :: a -> Word8
- toWord8 a = let w = fromEnum a in if 0 <= w && w <= 0xff
- then fromIntegral w
- else error $ "Cannot convert to Word8: " <> show a
-
-{- [NOTE] Disabled generator (normalization)
--- This bit of code is duplicated but this duplication allows us to reduce 2
--- dependencies on the executable.
-
-jamoLCount :: Int
-jamoLCount = 19
-
-jamoVCount :: Int
-jamoVCount = 21
-
-jamoTCount :: Int
-jamoTCount = 28
-
-hangulFirst :: Int
-hangulFirst = 0xac00
-
-hangulLast :: Int
-hangulLast = hangulFirst + jamoLCount * jamoVCount * jamoTCount - 1
-
-isHangul :: Char -> Bool
-isHangul c = n >= hangulFirst && n <= hangulLast
- where n = ord c
--}
-
-genUnicodeVersion :: FilePath -> IO ()
-genUnicodeVersion outdir = do
- version <- catch
- (getEnv "UNICODE_VERSION")
- (\(_ :: IOException) -> return "<unknown>")
- Stream.fold f (Stream.fromList (body version))
- where
- moduleName = "GHC.Internal.Unicode.Version"
- f = moduleFileEmitter Nothing outdir
- (moduleName, \_ -> Fold.foldMap (<> "\n"))
- body :: String -> [String]
- body version =
- [ "{-# LANGUAGE NoImplicitPrelude #-}"
- , "{-# OPTIONS_HADDOCK hide #-}"
- , ""
- , mkModuleHeader moduleName
- , "module " <> moduleName
- , "(unicodeVersion)"
- , "where"
- , ""
- , "import {-# SOURCE #-} GHC.Internal.Data.Version"
- , ""
- , "-- | Version of Unicode standard used by @base@:"
- , "-- [" <> version <> "](https://www.unicode.org/versions/Unicode" <> version <> "/)."
- , "--"
- , "-- @since base-4.15.0.0"
- , "unicodeVersion :: Version"
- , "unicodeVersion = makeVersion [" <> mkVersion version <> "]" ]
- mkVersion = foldr (\c acc -> case c of {'.' -> ',':' ':acc; _ -> c:acc}) mempty
-
--------------------------------------------------------------------------------
--- Parsers
--------------------------------------------------------------------------------
-
--------------------------------------------------------------------------------
--- Parsing UnicodeData.txt
--------------------------------------------------------------------------------
-
-genGeneralCategoryModule
- :: Monad m
- => String
- -> Fold m DetailedChar String
-genGeneralCategoryModule moduleName =
- done <$> Fold.foldl' step initial
-
- where
-
- -- (categories, expected char)
- initial = ([], '\0')
-
- step (acc, p) a = if p < _char a
- -- Fill missing char entry with default category Cn
- -- See: https://www.unicode.org/reports/tr44/#Default_Values_Table
- then step (Cn : acc, succ p) a
- -- Regular entry
- else (_generalCategory a : acc, succ (_char a))
-
- done (acc, _) = unlines
- [ "{-# LANGUAGE NoImplicitPrelude #-}"
- , "{-# LANGUAGE MagicHash #-}"
- , "{-# OPTIONS_HADDOCK hide #-}"
- , ""
- , mkModuleHeader moduleName
- , "module " <> moduleName
- , "(generalCategory)"
- , "where"
- , ""
- , "import GHC.Internal.Base (Char, Int, Ord(..), ord)"
- , "import GHC.Internal.Unicode.Bits (lookupIntN)"
- , ""
- , genEnumBitmap "generalCategory" Cn (reverse acc)
- ]
-
-readDecomp :: String -> (Maybe DecompType, Decomp)
-readDecomp s =
- if null wrds
- then (Nothing, DCSelf)
- else decmps wrds
-
- where
-
- decmps [] = error "Unreachable flow point"
- decmps y@(x:xs) =
- case dtmap x of
- DTCanonical -> (,) (Just DTCanonical) (readCP y)
- other -> (,) (Just other) (readCP xs)
-
- wrds = words s
-
- readCP ws = DC $ map readCodePoint ws
-
- dtmap "<compat>" = DTCompat
- dtmap "<circle>" = DTCircle
- dtmap "<final>" = DTFinal
- dtmap "<font>" = DTFont
- dtmap "<fraction>" = DTFraction
- dtmap "<initial>" = DTInitial
- dtmap "<isolated>" = DTIsolated
- dtmap "<medial>" = DTMedial
- dtmap "<narrow>" = DTNarrow
- dtmap "<noBreak>" = DTNoBreak
- dtmap "<small>" = DTSmall
- dtmap "<square>" = DTSquare
- dtmap "<sub>" = DTSub
- dtmap "<super>" = DTSuper
- dtmap "<vertical>" = DTVertical
- dtmap "<wide>" = DTWide
- dtmap _ = DTCanonical
-
-{- [NOTE] Disabled generators
-
-filterNonHangul :: Monad m => Fold m DetailedChar a -> Fold m DetailedChar a
-filterNonHangul = Fold.filter (not . isHangul . _char)
-
-filterDecomposableType ::
- Monad m => DType -> Fold m DetailedChar a -> Fold m DetailedChar a
-filterDecomposableType dtype =
- Fold.filter ((/= DCSelf) . _decomposition)
- . Fold.filter (predicate . _decompositionType)
-
- where
-
- predicate =
- case dtype of
- Canonical -> (== Just DTCanonical)
- Kompat -> const True
-
-genDecomposableModule ::
- Monad m => String -> DType -> Fold m DetailedChar String
-genDecomposableModule moduleName dtype =
- filterNonHangul
- $ filterDecomposableType dtype $ done <$> Fold.foldl' step initial
-
- where
-
- initial = []
-
- step st a = ord (_char a) : st
-
- done st =
- unlines
- [ "{-# LANGUAGE NoImplicitPrelude #-}"
- , "{-# LANGUAGE MagicHash #-}"
- , "{-# OPTIONS_HADDOCK hide #-}"
- , ""
- , mkModuleHeader moduleName
- , "module " <> moduleName
- , "(isDecomposable)"
- , "where"
- , ""
- , "import Data.Char (ord)"
- , "import GHC.Internal.Unicode.Bits (lookupBit64)"
- , ""
- , genBitmap "isDecomposable" (reverse st)
- ]
-
-genCombiningClassModule :: Monad m => String -> Fold m DetailedChar String
-genCombiningClassModule moduleName =
- Fold.filter (\dc -> _combiningClass dc /= 0)
- $ done <$> Fold.foldl' step initial
-
- where
-
- initial = ([], [])
-
- step (st1, st2) a = (genCombiningClassDef a : st1, ord (_char a) : st2)
-
- done (st1, st2) =
- unlines
- [ "{-# LANGUAGE NoImplicitPrelude #-}"
- , "{-# LANGUAGE MagicHash #-}"
- , "{-# OPTIONS_HADDOCK hide #-}"
- , ""
- , mkModuleHeader moduleName
- , "module " <> moduleName
- , "(combiningClass, isCombining)"
- , "where"
- , ""
- , "import Data.Char (ord)"
- , "import GHC.Internal.Unicode.Bits (lookupBit64)"
- , ""
- , "combiningClass :: Char -> Int"
- , unlines (reverse st1)
- , "combiningClass _ = 0\n"
- , ""
- , genBitmap "isCombining" (reverse st2)
- ]
-
- genCombiningClassDef dc =
- "combiningClass "
- <> show (_char dc) <> " = " <> show (_combiningClass dc)
-
-genDecomposeDefModule ::
- Monad m
- => String
- -> [String]
- -> [String]
- -> DType
- -> (Int -> Bool)
- -> Fold m DetailedChar String
-genDecomposeDefModule moduleName before after dtype pred =
- Fold.filter (pred . ord . _char)
- $ filterNonHangul
- $ filterDecomposableType dtype $ done <$> Fold.foldl' step initial
-
- where
-
- decomposeChar c DCSelf = [c]
- decomposeChar _c (DC ds) = ds
-
- genHeader =
- [ "{-# OPTIONS_GHC -fno-warn-incomplete-patterns #-}"
- , ""
- , mkModuleHeader moduleName
- , "module " <> moduleName
- , "(decompose)"
- , "where"
- , ""
- ]
- genSign =
- [ "-- Note: this is a partial function we do not expect to call"
- , "-- this if isDecomposable returns false."
- , "{-# NOINLINE decompose #-}"
- , "decompose :: Char -> [Char]"
- ]
- initial = []
-
- step st dc = genDecomposeDef dc : st
-
- done st =
- let body = mconcat [genHeader, before, genSign, reverse st, after]
- in unlines body
-
- genDecomposeDef dc =
- "decompose "
- <> show (_char dc)
- <> " = " <> show (decomposeChar (_char dc) (_decomposition dc))
-
-genCompositionsModule ::
- Monad m
- => String
- -> [Int]
- -> [Int]
- -> Fold m DetailedChar String
-genCompositionsModule moduleName compExclu non0CC =
- Fold.filter (not . flip elem compExclu . ord . _char)
- $ filterNonHangul
- $ Fold.filter (isDecompositionLen2 . _decomposition)
- $ filterDecomposableType Canonical $ done <$> Fold.foldl' step initial
-
- where
-
- isDecompositionLen2 DCSelf = False
- isDecompositionLen2 (DC ds) = length ds == 2
-
- genComposePairDef name dc =
- name
- <> " "
- <> show (head d01)
- <> " " <> show (d01 !! 1) <> " = Just " <> show (_char dc)
-
- where
-
- d01 = decompPair dc
-
- decompPair dc =
- case _decomposition dc of
- DCSelf -> error "toCompFormat: DCSelf"
- (DC ds) ->
- if length ds == 2
- then ds
- else error "toCompFormat: length /= 2"
-
- initial = ([], [], [])
-
- step (dec, sp, ss) dc = (dec1, sp1, ss1)
-
- where
-
- d01 = decompPair dc
- d1Ord = ord $ d01 !! 1
- dec1 = genComposePairDef "compose" dc : dec
- sp1 =
- if d1Ord `notElem` non0CC
- then genComposePairDef "composeStarters" dc : sp
- else sp
- ss1 =
- if d1Ord `notElem` non0CC
- then d1Ord : ss
- else ss
-
- header =
- [ "{-# LANGUAGE NoImplicitPrelude #-}"
- , "{-# LANGUAGE MagicHash #-}"
- , "{-# OPTIONS_GHC -fno-warn-incomplete-patterns #-}"
- , "{-# OPTIONS_HADDOCK hide #-}"
- , ""
- , mkModuleHeader moduleName
- , "module " <> moduleName
- , "(compose, composeStarters, isSecondStarter)"
- , "where"
- , ""
- , "import GHC.Internal.Base (Char, ord)"
- , "import GHC.Internal.Unicode.Bits (lookupBit64)"
- , ""
- ]
-
- composePair decomps =
- [ "{-# NOINLINE compose #-}"
- , "compose :: Char -> Char -> Maybe Char"
- , unlines decomps
- , "compose _ _ = " <> "Nothing" <> "\n"
- , ""
- ]
-
- composeStarterPair starterPairs =
- [ "composeStarters :: Char -> Char -> Maybe Char"
- , unlines starterPairs
- , "composeStarters _ _ = " <> "Nothing" <> "\n"
- , ""
- ]
-
- isSecondStarter secondStarters =
- [genBitmap "isSecondStarter" secondStarters]
-
- done (dec, sp, ss) =
- unlines
- $ header
- <> composePair (reverse dec)
- <> composeStarterPair (reverse sp)
- <> isSecondStarter (Set.toList (Set.fromList ss))
--}
-genSimpleCaseMappingModule
- :: Monad m
- => String
- -> String
- -> (DetailedChar -> Maybe Char)
- -> Fold m DetailedChar String
-genSimpleCaseMappingModule moduleName funcName field =
- done <$> Fold.foldl' step initial
-
- where
-
- genHeader =
- [ "{-# LANGUAGE NoImplicitPrelude, LambdaCase #-}"
- , "{-# OPTIONS_HADDOCK hide #-}"
- , ""
- , mkModuleHeader moduleName
- , "module " <> moduleName
- , "(" <> funcName <> ")"
- , "where"
- , ""
- , "import GHC.Internal.Base (Char)"
- , ""
- ]
- genSign =
- [ "{-# NOINLINE " <> funcName <> " #-}"
- , funcName <> " :: Char -> Char"
- , funcName <> " = \\case"
- ]
- initial = []
-
- step ds dc = case mkEntry dc of
- Nothing -> ds
- Just d -> d : ds
-
- after = [" c -> c"]
-
- done st =
- let body = mconcat [genHeader, genSign, reverse st, after]
- in unlines body
-
- mkEntry dc = field dc <&> \c -> mconcat
- [ " '\\x"
- , showHexChar (_char dc) "' -> '\\x"
- , showHexChar c "'"
- ]
-
- showHexChar c = showHex (ord c)
-
-genCorePropertiesModule ::
- Monad m => String -> (String -> Bool) -> Fold m (String, [Int]) String
-genCorePropertiesModule moduleName isProp =
- Fold.filter (\(name, _) -> isProp name) $ done <$> Fold.foldl' step initial
-
- where
-
- prop2FuncName x = "is" <> x
-
- initial = ([], [])
-
- step (props, bitmaps) (name, bits) =
- (name : props, genBitmap (prop2FuncName name) bits : bitmaps)
-
- done (props, bitmaps) = unlines $ header props <> bitmaps
-
- header exports =
- [ "{-# LANGUAGE NoImplicitPrelude #-}"
- , "{-# LANGUAGE MagicHash #-}"
- , "{-# OPTIONS_HADDOCK hide #-}"
- , ""
- , mkModuleHeader moduleName
- , "module " <> moduleName
- , "(" <> unwords (intersperse "," (map prop2FuncName exports)) <> ")"
- , "where"
- , ""
- , "import GHC.Internal.Base (Bool, Char, Ord(..), (&&), ord)"
- , "import GHC.Internal.Unicode.Bits (lookupBit64)"
- , ""
- ]
-
-{- [NOTE] Disabled generator
-genUnicode002TestResults :: Monad m => Fold m DetailedChar String
-genUnicode002TestResults = done <$> Fold.foldl' step initial
-
- where
-
- header = "Code C P S U L A D"
- -- (output, expected char)
- initial = ([], '\0')
- -- [TODO] Increase the number of tested char?
- -- maxChar = '\xF0000' -- First codepoint of the last private use areas.
- -- maxChar = '\xFFFF' -- Last codepoint of BMP.
- maxChar = chr 6553 -- Value in GHC 9.2.2 base test
-
- step (acc, c) dc = if c > maxChar
- then (acc, c)
- else if c < _char dc
- -- Fill missing char entry with default values
- -- See: https://www.unicode.org/reports/tr44/#Default_Values_Table
- then step (mkEntry (mkDefaultDetailedChar c) : acc, succ c) dc
- -- Regular entry
- else (mkEntry dc : acc, succ (_char dc))
-
- done (acc, _) = unlines (header : reverse acc)
-
- mkEntry dc = mconcat
- [ showCode (_char dc)
- -- [TODO] General category
- , showBool (isControl' dc)
- , showBool (isPrint' dc)
- , showBool (isSpace' dc)
- -- [TODO] isSeparator
- , showBool (isUpper' dc)
- , showBool (isLower' dc)
- , showBool (isAlpha' dc)
- -- [TODO] isAlphaNum
- , showBool (isDigit' dc)
- -- [TODO] isNumber
- -- [TODO] isMark
- -- [TODO] isPunctuation
- -- [TODO] isSymbol
- ]
-
- padding = length (show (ord maxChar))
- showCode c = take padding (shows (ord c) (repeat ' '))
- -- [TODO] use showHex
- -- showCode c =
- -- let code = showHex (ord c) mempty
- -- in replicate (padding - length code) '0' <> code
- showBool b = if b then " T" else " F"
-
- -- [NOTE] The following functions replicates Data.Char. Keep them up to date!
-
- isControl' dc = case _generalCategory dc of
- Cc -> True -- Control
- _ -> False
-
- isPrint' dc = case _generalCategory dc of
- Zl -> False -- LineSeparator
- Zp -> False -- ParagraphSeparator
- Cc -> False -- Control
- Cf -> False -- Format
- Cs -> False -- Surrogate
- Co -> False -- PrivateUse
- Cn -> False -- NotAssigned
- _ -> True
-
- isSpace' dc = case _char dc of
- '\t' -> True
- '\n' -> True
- '\v' -> True
- '\f' -> True
- '\r' -> True
- _ -> case _generalCategory dc of
- Zs -> True -- Space
- _ -> False
-
- isUpper' dc = case _generalCategory dc of
- Lu -> True -- UppercaseLetter
- Lt -> True -- TitlecaseLetter
- _ -> False
-
- isLower' dc = case _generalCategory dc of
- Ll -> True -- LowercaseLetter
- _ -> False
-
- isAlpha' dc = case _generalCategory dc of
- Lu -> True -- UppercaseLetter
- Ll -> True -- LowercaseLetter
- Lt -> True -- TitlecaseLetter
- Lm -> True -- ModifierLetter
- Lo -> True -- OtherLetter
- _ -> False
-
- isDigit' dc = let c = _char dc
- in (fromIntegral (ord c - ord '0') :: Word) <= 9
--}
-
--------------------------------------------------------------------------------
--- Parsing property files
--------------------------------------------------------------------------------
-
-type PropertyLine = (String, [Int])
-
-trim :: String -> String
-trim = takeWhile (not . isSpace) . dropWhile isSpace
-
-emptyPropertyLine :: PropertyLine
-emptyPropertyLine = ("", [])
-
-combinePropertyLines :: PropertyLine -> PropertyLine -> PropertyLine
-combinePropertyLines t1@(n1, o1) t2@(n2, o2)
- | n1 == "" = t2
- | n2 == "" = t1
- | n1 == n2 = (n1, o1 <> o2)
- | otherwise = error $ "Cannot group " <> n1 <> " with " <> n2
-
-parsePropertyLine :: String -> PropertyLine
-parsePropertyLine ln
- | null ln = emptyPropertyLine
- | head ln == '#' = emptyPropertyLine
- | otherwise = parseLineJ ln
-
- where
-
- parseLineJ :: String -> (String, [Int])
- parseLineJ line =
- let (rangeLn, line1) = span (/= ';') line
- propLn = takeWhile (/= '#') (tail line1)
- in (trim propLn, parseRange (trim rangeLn))
-
- parseRange :: String -> [Int]
- parseRange rng =
- if '.' `elem` rng
- then let low = read $ "0x" <> takeWhile (/= '.') rng
- high =
- read $ "0x" <> reverse (takeWhile (/= '.') (reverse rng))
- in [low .. high]
- else [read $ "0x" <> rng]
-
-isDivider :: String -> Bool
-isDivider x = x == "# ================================================"
-
-parsePropertyLines :: (Monad m) => Stream m String -> Stream m PropertyLine
-parsePropertyLines =
- Stream.splitOn isDivider
- $ Fold.lmap parsePropertyLine
- $ Fold.foldl' combinePropertyLines emptyPropertyLine
-
--- | A range entry in @UnicodeData.txt at .
-data UnicodeDataRange
- = SingleCode !DetailedChar
- -- ^ Regular entry for one code point
- | FirstCode !String !DetailedChar
- -- ^ A partial range for entry with a name as: @\<RANGE_IDENTIFIER, First\>@
- | CompleteRange !String !DetailedChar !DetailedChar
- -- ^ A complete range, requiring 2 continuous entries with respective names:
- --
- -- * @\<RANGE_IDENTIFIER, First\>@
- -- * @\<RANGE_IDENTIFIER, Last\>@
-
-{-| Parse UnicodeData.txt lines
-
-Parse ranges according to https://www.unicode.org/reports/tr44/#Code_Point_Ranges.
-
-__Note:__ this does /not/ fill missing char entries,
-i.e. entries with no explicit entry nor within a range.
--}
-parseUnicodeDataLines :: forall m. (Monad m) => Stream m String -> Stream m DetailedChar
-parseUnicodeDataLines
- = Stream.unfoldMany (Unfold.unfoldr unitToRange)
- . Stream.foldMany ( Fold.lmap parseDetailedChar
- $ Fold.foldt' step initial id)
-
- where
-
- step :: Maybe UnicodeDataRange
- -> DetailedChar
- -> Fold.Step (Maybe UnicodeDataRange) (Maybe UnicodeDataRange)
- step Nothing dc = case span (/= ',') (_name dc) of
- (range, ", First>") -> Fold.Partial (Just (FirstCode range dc))
- _ -> Fold.Done (Just (SingleCode dc))
- step (Just (FirstCode range1 dc1)) dc2 = case span (/= ',') (_name dc2) of
- (range2, ", Last>") -> if range1 == range2 && _char dc1 < _char dc2
- then Fold.Done (Just (CompleteRange range1 dc1 dc2))
- else error $ "Cannot create range: incompatible ranges" <> show (dc1, dc2)
- _ -> error $ "Cannot create range: missing <range, Last> entry correspong to: " <> show range1
- step _ _ = error "impossible case"
-
- initial :: Fold.Step (Maybe UnicodeDataRange) (Maybe UnicodeDataRange)
- initial = Fold.Partial Nothing
-
- unitToRange :: Maybe UnicodeDataRange -> Maybe (DetailedChar, Maybe UnicodeDataRange)
- unitToRange = fmap $ \case
- SingleCode dc -> (dc, Nothing)
- FirstCode _ dc -> error $ "Incomplete range: " <> show dc
- CompleteRange range dc1 dc2 -> if _char dc1 < _char dc2
- -- [TODO] Create the proper name
- then (dc1{_name="TODO"}, Just (CompleteRange range dc1{_char=succ (_char dc1)} dc2))
- else (dc2{_name="TODO"}, Nothing)
-
--- | Parse a single entry of @UnicodeData.txt@
-parseDetailedChar :: String -> DetailedChar
-parseDetailedChar line = case splitWhen (== ';') line of
- char
- :name
- :gc
- :combining
- :_bidi
- :decomposition
- :_decimal
- :_digit
- :_numeric
- :_bidiM
- :_uni1Name
- :_iso
- :sUpper
- :sLower
- :sTitle
- :_ ->
- let (dctype, dcval) = readDecomp decomposition
- in DetailedChar
- { _char = readCodePoint char
- , _name = name
- , _generalCategory = read gc
- , _combiningClass = read combining
- , _decompositionType = dctype
- , _decomposition = dcval
- , _simpleUppercaseMapping = readCodePointM sUpper
- , _simpleLowercaseMapping = readCodePointM sLower
- , _simpleTitlecaseMapping = readCodePointM sTitle
- }
- _ -> error ("Unsupported line: " <> line)
-
--------------------------------------------------------------------------------
--- Generation
--------------------------------------------------------------------------------
-
-readLinesFromFile :: String -> Stream IO String
-readLinesFromFile file =
- withFile file Sys.ReadMode
- $ \h -> Handle.read h & Unicode.decodeUtf8 & Unicode.lines Fold.toList
-
- where
- withFile file_ mode =
- Stream.bracketIO (Sys.openFile file_ mode) (Sys.hClose)
-
-
-moduleToFileName :: String -> String
-moduleToFileName = map (\x -> if x == '.' then '/' else x)
-
-dirFromFileName :: String -> String
-dirFromFileName = reverse . dropWhile (/= '/') . reverse
-
-data FileRecipe a
- = ModuleRecipe
- -- ^ A recipe to create a Haskell module file.
- String
- -- ^ Module name
- (String -> Fold IO a String)
- -- ^ Function that generate the module, given the module name.
- | TestOutputRecipe
- -- ^ A recipe to create a test output file.
- String
- -- ^ Test name
- (Fold IO a String)
- -- ^ Test output generator
-
--- ModuleRecipe is a tuple of the module name and a function that generates the
--- module using the module name
-type ModuleRecipe a = (String, String -> Fold IO a String)
-type TestOutputRecipe a = (FilePath, Fold IO a String)
-
--- GeneratorRecipe is a list of ModuleRecipe
-type GeneratorRecipe a = [FileRecipe a]
-
-moduleFileEmitter :: Maybe FilePath -> FilePath -> ModuleRecipe a -> Fold IO a ()
-moduleFileEmitter mfile outdir (modName, fldGen) = Fold.rmapM action $ fldGen modName
-
- where
-
- pretext version = case mfile of
- Just file -> mconcat
- [ "-- DO NOT EDIT: This file is automatically generated by the internal tool ucd2haskell,\n"
- , "-- with data from: https://www.unicode.org/Public/"
- , version
- , "/ucd/"
- , file
- ,".\n\n"
- ]
- Nothing -> "-- DO NOT EDIT: This file is automatically generated by the internal tool ucd2haskell.\n\n"
- outfile = outdir </> moduleToFileName modName <.> ".hs"
- outfiledir = dirFromFileName outfile
- action c = do
- version <-
- catch
- (getEnv "UNICODE_VERSION")
- (\(_ :: IOException) -> return "<unknown>")
- createDirectoryIfMissing True outfiledir
- writeFile outfile (pretext version <> c)
-
-testOutputFileEmitter :: FilePath -> TestOutputRecipe a -> Fold IO a ()
-testOutputFileEmitter outdir (name, fldGen) = Fold.rmapM action fldGen
-
- where
-
- outfile = outdir </> "tests" </> name <.> ".stdout"
- outfiledir = dirFromFileName outfile
- action c
- = createDirectoryIfMissing True outfiledir
- *> writeFile outfile c
-
-runGenerator ::
- FilePath
- -> FilePath
- -> (Stream IO String -> Stream IO a)
- -> FilePath
- -> GeneratorRecipe a
- -> IO ()
-runGenerator indir file transformLines outdir recipes =
- readLinesFromFile (indir <> file) & transformLines & Stream.fold combinedFld
-
- where
-
- generatedFolds = recipes <&> \case
- ModuleRecipe name f -> moduleFileEmitter (Just file) outdir (name, f)
- TestOutputRecipe name f -> testOutputFileEmitter outdir (name, f)
- combinedFld = void $ Fold.distribute generatedFolds
-
-genModules :: String -> String -> [String] -> IO ()
-genModules indir outdir props = do
- genUnicodeVersion outdir
-
- -- [NOTE] Disabled generator
- -- compExclu <-
- -- readLinesFromFile (indir <> "DerivedNormalizationProps.txt")
- -- & parsePropertyLines
- -- & Stream.find (\(name, _) -> name == "Full_Composition_Exclusion")
- -- & fmap (snd . fromMaybe ("", []))
-
- -- [NOTE] Disabled generator
- -- non0CC <-
- -- readLinesFromFile (indir <> "extracted/DerivedCombiningClass.txt")
- -- & parsePropertyLines
- -- & Stream.filter (\(name, _) -> name /= "0")
- -- & Stream.map snd
- -- & Stream.fold (Fold.foldl' (<>) [])
-
- runGenerator
- indir
- "UnicodeData.txt"
- parseUnicodeDataLines
- outdir
- -- [NOTE] Disabled generators
- -- [ uncurry ModuleRecipe compositions compExclu non0CC
- -- , uncurry ModuleRecipe combiningClass
- -- , uncurry ModuleRecipe decomposable
- -- , uncurry ModuleRecipe decomposableK
- -- , uncurry ModuleRecipe decompositions
- -- , uncurry ModuleRecipe decompositionsK2
- -- , uncurry ModuleRecipe decompositionsK
- [ uncurry ModuleRecipe generalCategory
- , uncurry ModuleRecipe simpleUpperCaseMapping
- , uncurry ModuleRecipe simpleLowerCaseMapping
- , uncurry ModuleRecipe simpleTitleCaseMapping
- -- , uncurry TestOutputRecipe unicode002Test
- ]
-
- -- [NOTE] Disabled generator
- -- runGenerator
- -- indir
- -- "PropList.txt"
- -- parsePropertyLines
- -- outdir
- -- [ uncurry ModuleRecipe propList ]
-
- runGenerator
- indir
- "DerivedCoreProperties.txt"
- parsePropertyLines
- outdir
- [ uncurry ModuleRecipe derivedCoreProperties ]
-
- where
-
- -- [NOTE] Disabled generator
- -- propList =
- -- ("GHC.Internal.Unicode.Char.PropList"
- -- , (`genCorePropertiesModule` (`elem` props)))
-
- derivedCoreProperties =
- ("GHC.Internal.Unicode.Char.DerivedCoreProperties"
- , (`genCorePropertiesModule` (`elem` props)))
-
- -- [NOTE] Disabled generator
- -- compositions exc non0 =
- -- ( "GHC.Internal.Unicode.Char.UnicodeData.Compositions"
- -- , \m -> genCompositionsModule m exc non0)
-
- -- [NOTE] Disabled generator
- -- combiningClass =
- -- ( "GHC.Internal.Unicode.Char.UnicodeData.CombiningClass"
- -- , genCombiningClassModule)
-
- -- [NOTE] Disabled generator
- -- decomposable =
- -- ( "GHC.Internal.Unicode.Char.UnicodeData.Decomposable"
- -- , (`genDecomposableModule` Canonical))
-
- -- [NOTE] Disabled generator
- -- decomposableK =
- -- ( "GHC.Internal.Unicode.Char.UnicodeData.DecomposableK"
- -- , (`genDecomposableModule` Kompat))
-
- -- [NOTE] Disabled generator
- -- decompositions =
- -- ( "GHC.Internal.Unicode.Char.UnicodeData.Decompositions"
- -- , \m -> genDecomposeDefModule m [] [] Canonical (const True))
-
- -- [NOTE] Disabled generator
- -- decompositionsK2 =
- -- ( "GHC.Internal.Unicode.Char.UnicodeData.DecompositionsK2"
- -- , \m -> genDecomposeDefModule m [] [] Kompat (>= 60000))
-
- -- [NOTE] Disabled generator
- -- decompositionsK =
- -- let pre = ["import qualified " <> fst decompositionsK2 <> " as DK2", ""]
- -- post = ["decompose c = DK2.decompose c"]
- -- in ( "GHC.Internal.Unicode.Char.UnicodeData.DecompositionsK"
- -- , \m -> genDecomposeDefModule m pre post Kompat (< 60000))
-
- generalCategory =
- ( "GHC.Internal.Unicode.Char.UnicodeData.GeneralCategory"
- , genGeneralCategoryModule)
-
- simpleUpperCaseMapping =
- ( "GHC.Internal.Unicode.Char.UnicodeData.SimpleUpperCaseMapping"
- , \m -> genSimpleCaseMappingModule m "toSimpleUpperCase" _simpleUppercaseMapping)
-
- simpleLowerCaseMapping =
- ( "GHC.Internal.Unicode.Char.UnicodeData.SimpleLowerCaseMapping"
- , \m -> genSimpleCaseMappingModule m "toSimpleLowerCase" _simpleLowercaseMapping)
-
- simpleTitleCaseMapping =
- ( "GHC.Internal.Unicode.Char.UnicodeData.SimpleTitleCaseMapping"
- , \m -> genSimpleCaseMappingModule m "toSimpleTitleCase" _simpleTitlecaseMapping)
-
- -- unicode002Test =
- -- ( "unicode002"
- -- , genUnicode002TestResults)
=====================================
libraries/ghc-internal/tools/ucd2haskell/exe/UCD2Haskell.hs
=====================================
@@ -2,14 +2,17 @@
-- Module : Main
-- Copyright : (c) 2020 Composewell Technologies and Contributors
-- License : BSD-3-Clause
--- Maintainer : streamly at composewell.com
+-- Maintainer : The GHC Developers <ghc-devs at haskell.org>"
-- Stability : internal
--
module Main where
-import WithCli (HasArguments(..), withCli)
-import Parser.Text (genModules)
+import qualified Data.ByteString.Char8 as B8
+import qualified Data.ByteString.Short as BS
import GHC.Generics (Generic)
+import WithCli (HasArguments(..), withCli)
+
+import UCD2Haskell.ModuleGenerators (genModules)
data CLIOptions =
CLIOptions
@@ -20,7 +23,10 @@ data CLIOptions =
deriving (Show, Generic, HasArguments)
cliClient :: CLIOptions -> IO ()
-cliClient opts = genModules (input opts) (output opts) (core_prop opts)
+cliClient opts = genModules
+ opts.input
+ opts.output
+ (BS.toShort . B8.pack <$> opts.core_prop)
main :: IO ()
main = withCli cliClient
=====================================
libraries/ghc-internal/tools/ucd2haskell/exe/UCD2Haskell/ModuleGenerators.hs
=====================================
@@ -0,0 +1,517 @@
+-- |
+-- Module : UCD2Haskell.ModuleGenerators
+-- Copyright : (c) 2020 Composewell Technologies and Contributors
+-- (c) 2016-2017 Harendra Kumar
+-- (c) 2014-2015 Antonio Nikishaev
+-- (c) 2022-2024 Pierre Le Marre
+-- License : BSD-3-Clause
+-- Maintainer : The GHC Developers <ghc-devs at haskell.org>"
+-- Stability : internal
+
+-- Code history:
+--
+-- This code was adapted from https://github.com/composewell/unicode-data/
+-- (around commit c4aa52ed932ad8badf97296858932c3389b275b8) by Pierre Le Marre.
+-- The original Unicode database parser was taken from
+-- https://github.com/composewell/unicode-transforms but was completely
+-- rewritten from scratch to parse from UCD text files instead of XML, only
+-- some types remain the same. That code in turn was originally taken from
+-- https://github.com/llelf/prose (Antonio Nikishaev) and heavily modified by
+-- Harendra Kumar.
+--
+module UCD2Haskell.ModuleGenerators (genModules) where
+
+import Control.Exception (catch, IOException)
+import Data.Bits (Bits(..))
+import Data.Word (Word8)
+import Data.Char (ord)
+import Data.Functor ((<&>), ($>))
+import Data.List (intersperse, unfoldr)
+import System.Directory (createDirectoryIfMissing)
+import System.Environment (getEnv)
+import System.FilePath ((</>), (<.>))
+import Data.String (IsString)
+import Data.Foldable (Foldable(..))
+
+import qualified Data.ByteString.Builder as BB
+import qualified Data.ByteString as B
+import qualified Data.ByteString.Lazy as BL
+import qualified Data.ByteString.Short as BS
+
+import qualified Unicode.CharacterDatabase.Parser.Common as C
+import qualified Unicode.CharacterDatabase.Parser.UnicodeData as UD
+import qualified Unicode.CharacterDatabase.Parser.Properties.Multiple as P
+
+import Prelude hiding (pred)
+
+--------------------------------------------------------------------------------
+-- Helpers
+--------------------------------------------------------------------------------
+
+unlinesBB :: [BB.Builder] -> BB.Builder
+unlinesBB = (<> "\n") . mconcat . intersperse "\n"
+
+unwordsBB :: [BB.Builder] -> BB.Builder
+unwordsBB = mconcat . intersperse " "
+
+headerRule :: BB.Builder
+headerRule = "-----------------------------------------------------------------------------"
+
+mkModuleHeader :: BB.Builder -> BB.Builder
+mkModuleHeader modName =
+ unlinesBB
+ [ headerRule
+ , "-- |"
+ , "-- Module : " <> modName
+ , "-- License : BSD-3-Clause"
+ , "-- Maintainer : The GHC Developers <ghc-devs at haskell.org>"
+ , "-- Stability : internal"
+ , headerRule
+ ]
+
+genSignature :: BB.Builder -> BB.Builder
+genSignature = (<> " :: Char -> Bool")
+
+-- | Check that var is between minimum and maximum of orderList
+genRangeCheck :: BB.Builder -> [Int] -> BB.Builder
+genRangeCheck var ordList =
+ var
+ <> " >= "
+ <> BB.intDec (minimum ordList)
+ <> " && " <> var <> " <= " <> BB.intDec (maximum ordList)
+
+genBitmap :: BB.Builder -> [Int] -> BB.Builder
+genBitmap funcName ordList =
+ unlinesBB
+ [ "{-# INLINE " <> funcName <> " #-}"
+ , genSignature funcName
+ , funcName <> " = \\c -> let n = ord c in "
+ <> genRangeCheck "n" ordList <> " && lookupBit64 bitmap# n"
+ , " where"
+ , " bitmap# = \"" <> bitMapToAddrLiteral (positionsToBitMap ordList) "\"#"
+ ]
+
+positionsToBitMap :: [Int] -> [Bool]
+positionsToBitMap = go 0
+
+ where
+
+ go _ [] = []
+ go i xxs@(x:xs)
+ | i < x = False : go (i + 1) xxs
+ | otherwise = True : go (i + 1) xs
+
+bitMapToAddrLiteral ::
+ -- | Values to encode
+ [Bool] ->
+ -- | String to append
+ BB.Builder ->
+ BB.Builder
+bitMapToAddrLiteral bs cs = foldr encode cs (unfoldr mkChunks bs)
+
+ where
+
+ mkChunks :: [a] -> Maybe ([a], [a])
+ mkChunks [] = Nothing
+ mkChunks xs = Just $ splitAt 8 xs
+
+ encode :: [Bool] -> BB.Builder -> BB.Builder
+ encode chunk acc = BB.char7 '\\' <> BB.intDec (toByte (padTo8 chunk)) <> acc
+
+ padTo8 :: [Bool] -> [Bool]
+ padTo8 xs
+ | length xs >= 8 = xs
+ | otherwise = xs <> replicate (8 - length xs) False
+
+ toByte :: [Bool] -> Int
+ toByte xs = sum $ map (\i -> if xs !! i then 1 `shiftL` i else 0) [0..7]
+
+genEnumBitmap ::
+ forall a. (Bounded a, Enum a, Show a) =>
+ -- | Function name
+ BB.Builder ->
+ -- | Default value
+ a ->
+ -- | List of values to encode
+ [a] ->
+ BB.Builder
+genEnumBitmap funcName def as = unlinesBB
+ [ "{-# INLINE " <> funcName <> " #-}"
+ , funcName <> " :: Char -> Int"
+ , funcName <> " c = let n = ord c in if n >= "
+ <> BB.intDec (length as)
+ <> " then "
+ <> BB.intDec (fromEnum def)
+ <> " else lookup_bitmap n"
+
+ , "{-# NOINLINE lookup_bitmap #-}"
+ , "lookup_bitmap :: Int -> Int"
+ , "lookup_bitmap n = lookupIntN bitmap# n"
+ , " where"
+ , " bitmap# = \"" <> enumMapToAddrLiteral as "\"#"
+ ]
+
+{-| Encode a list of values as a byte map, using their 'Enum' instance.
+
+__Note:__ 'Enum' instance must respect the following:
+
+* @fromEnum minBound >= 0x00@
+* @fromEnum maxBound <= 0xff@
+-}
+enumMapToAddrLiteral ::
+ forall a. (Bounded a, Enum a, Show a) =>
+ -- | Values to encode
+ [a] ->
+ -- | String to append
+ BB.Builder ->
+ BB.Builder
+enumMapToAddrLiteral xs cs = foldr go cs xs
+
+ where
+
+ go :: a -> BB.Builder -> BB.Builder
+ go x acc = BB.char7 '\\' <> BB.word8Dec (toWord8 x) <> acc
+
+ toWord8 :: a -> Word8
+ toWord8 a = let w = fromEnum a in if 0 <= w && w <= 0xff
+ then fromIntegral w
+ else error $ "Cannot convert to Word8: " <> show a
+
+genUnicodeVersion :: FilePath -> IO ()
+genUnicodeVersion outdir = do
+ version <- catch
+ (getEnv "UNICODE_VERSION")
+ (\(_ :: IOException) -> return "<unknown>")
+ runFold f [body version]
+ where
+ moduleName :: (IsString a) => a
+ moduleName = "GHC.Internal.Unicode.Version"
+ f = moduleFileEmitter Nothing outdir
+ (moduleName, \_ -> Fold (\_ x -> x) mempty id)
+ body :: String -> BB.Builder
+ body version = unlinesBB
+ [ "{-# LANGUAGE NoImplicitPrelude #-}"
+ , "{-# OPTIONS_HADDOCK hide #-}"
+ , ""
+ , mkModuleHeader moduleName
+ , "module " <> moduleName
+ , "(unicodeVersion)"
+ , "where"
+ , ""
+ , "import {-# SOURCE #-} GHC.Internal.Data.Version"
+ , ""
+ , "-- | Version of Unicode standard used by @base@:"
+ , "-- [" <> BB.string7 version <> "](https://www.unicode.org/versions/Unicode" <> BB.string7 version <> "/)."
+ , "--"
+ , "-- @since base-4.15.0.0"
+ , "unicodeVersion :: Version"
+ , "unicodeVersion = makeVersion [" <> mkVersion version <> "]" ]
+ mkVersion = foldMap (\c -> case c of {'.' -> BB.char7 ',' <> BB.char7 ' '; _ -> BB.char7 c})
+
+--------------------------------------------------------------------------------
+-- Fold
+--------------------------------------------------------------------------------
+
+data Fold a b = forall s. Fold
+ { _step :: s -> a -> s
+ , _initial :: s
+ , _final :: s -> b }
+
+data Pair a b = Pair !a !b
+
+teeWith :: (a -> b -> c) -> Fold x a -> Fold x b -> Fold x c
+teeWith f (Fold stepL initialL finalL) (Fold stepR initialR finalR) =
+ Fold step initial final
+ where
+ step (Pair sL sR) x = Pair (stepL sL x) (stepR sR x)
+ initial = Pair initialL initialR
+ final (Pair sL sR) = f (finalL sL) (finalR sR)
+
+distribute :: [Fold a b] -> Fold a [b]
+distribute = foldr (teeWith (:)) (Fold const () (const []))
+
+rmapFold :: (b -> c) -> Fold a b -> Fold a c
+rmapFold f (Fold step initial final) = Fold step initial (f . final)
+
+runFold :: Fold a b -> [a] -> b
+runFold (Fold step initial final) = final . foldl' step initial
+
+--------------------------------------------------------------------------------
+-- Modules generators
+--------------------------------------------------------------------------------
+
+data GeneralCategoryAcc = GeneralCategoryAcc
+ { _categories :: ![UD.GeneralCategory]
+ , _expectedChar :: !Char
+ }
+
+genGeneralCategoryModule :: BB.Builder -> Fold UD.Entry BB.Builder
+genGeneralCategoryModule moduleName = Fold step initial done
+
+ where
+
+ -- (categories, expected char)
+ initial = GeneralCategoryAcc [] '\0'
+
+ step (GeneralCategoryAcc acc p) e@(UD.Entry r d)
+ | p < r.start
+ -- Fill missing char entry with default category Cn
+ -- See: https://www.unicode.org/reports/tr44/#Default_Values_Table
+ = step (GeneralCategoryAcc (replicate (ord r.start - ord p) UD.Cn <> acc) r.start) e
+ -- Regular entry
+ | otherwise = case r of
+ C.SingleChar ch -> GeneralCategoryAcc
+ (d.generalCategory : acc)
+ (succ ch)
+ C.CharRange ch1 ch2 -> GeneralCategoryAcc
+ (replicate (ord ch2 - ord ch1 + 1) d.generalCategory <> acc)
+ (succ ch2)
+
+ done (GeneralCategoryAcc acc _) = unlinesBB
+ [ "{-# LANGUAGE NoImplicitPrelude #-}"
+ , "{-# LANGUAGE MagicHash #-}"
+ , "{-# OPTIONS_HADDOCK hide #-}"
+ , ""
+ , mkModuleHeader moduleName
+ , "module " <> moduleName
+ , "(generalCategory)"
+ , "where"
+ , ""
+ , "import GHC.Internal.Base (Char, Int, Ord(..), ord)"
+ , "import GHC.Internal.Unicode.Bits (lookupIntN)"
+ , ""
+ , genEnumBitmap "generalCategory" UD.Cn (reverse acc)
+ ]
+
+genSimpleCaseMappingModule
+ :: BB.Builder
+ -> BB.Builder
+ -> (UD.CharDetails -> Maybe Char)
+ -> Fold UD.Entry BB.Builder
+genSimpleCaseMappingModule moduleName funcName field =
+ Fold step initial done
+
+ where
+
+ genHeader =
+ [ "{-# LANGUAGE NoImplicitPrelude, LambdaCase #-}"
+ , "{-# OPTIONS_HADDOCK hide #-}"
+ , ""
+ , mkModuleHeader moduleName
+ , "module " <> moduleName
+ , "(" <> funcName <> ")"
+ , "where"
+ , ""
+ , "import GHC.Internal.Base (Char)"
+ , ""
+ ]
+ genSign =
+ [ "{-# NOINLINE " <> funcName <> " #-}"
+ , funcName <> " :: Char -> Char"
+ , funcName <> " = \\case"
+ ]
+ initial = []
+
+ step ds dc = case mkEntry dc of
+ Nothing -> ds
+ Just d -> d : ds
+
+ after = [" c -> c"]
+
+ done st =
+ let body = mconcat [genHeader, genSign, reverse st, after]
+ in unlinesBB body
+
+ mkEntry (UD.Entry r dc) = case r of
+ C.SingleChar ch -> field dc <&> \c -> mconcat
+ [ " '\\x"
+ , showHexChar ch
+ , "' -> '\\x"
+ , showHexChar c
+ , "'"
+ ]
+ C.CharRange{} -> field dc $> error ("genSimpleCaseMappingModule: unexpected char range: " <> show r)
+
+ showHexChar c = BB.wordHex (fromIntegral (ord c))
+
+data PropertiesAcc = PropertiesAcc
+ { _properties :: ![BS.ShortByteString]
+ , _bitmaps :: ![BB.Builder]
+ , _currentBitmap :: ![[Int]] }
+
+genCorePropertiesModule ::
+ BB.Builder -> (BS.ShortByteString -> Bool) -> Fold P.Entry BB.Builder
+genCorePropertiesModule moduleName isProp = Fold step initial done
+ where
+ prop2FuncName x = "is" <> BB.shortByteString x
+
+ initial = PropertiesAcc [] [] []
+
+ step acc@(PropertiesAcc props bitmaps bits) P.Entry{..}
+ | not (isProp property) = acc -- property filtered out
+ | otherwise = case props of
+ prop' : _
+ | prop' == property -> PropertiesAcc props bitmaps (rangeToBits range : bits)
+ | otherwise -> PropertiesAcc
+ { _properties = property : props
+ , _bitmaps = genBitmap' prop' bits : bitmaps
+ , _currentBitmap = [rangeToBits range] }
+ _ -> PropertiesAcc [property] bitmaps [rangeToBits range]
+
+ rangeToBits = \case
+ C.SingleChar ch -> [ord ch]
+ C.CharRange ch1 ch2 -> [ord ch1 .. ord ch2]
+
+ genBitmap' prop bits = genBitmap (prop2FuncName prop) (mconcat (reverse bits))
+
+ done (PropertiesAcc props bitmaps bits) = unlinesBB (header props <> bitmaps')
+ where
+ lastProp = case props of
+ prop : _ -> prop
+ [] -> error "impossible"
+ bitmaps' = genBitmap' lastProp bits : bitmaps
+
+ header exports =
+ [ "{-# LANGUAGE NoImplicitPrelude #-}"
+ , "{-# LANGUAGE MagicHash #-}"
+ , "{-# OPTIONS_HADDOCK hide #-}"
+ , ""
+ , mkModuleHeader moduleName
+ , "module " <> moduleName
+ , "(" <> unwordsBB (intersperse "," (map prop2FuncName exports)) <> ")"
+ , "where"
+ , ""
+ , "import GHC.Internal.Base (Bool, Char, Ord(..), (&&), ord)"
+ , "import GHC.Internal.Unicode.Bits (lookupBit64)"
+ , ""
+ ]
+
+--------------------------------------------------------------------------------
+-- Generation
+--------------------------------------------------------------------------------
+
+moduleToFileName :: String -> String
+moduleToFileName = map (\x -> if x == '.' then '/' else x)
+
+dirFromFileName :: String -> String
+dirFromFileName = reverse . dropWhile (/= '/') . reverse
+
+data FileRecipe a
+ = ModuleRecipe
+ -- ^ A recipe to create a Haskell module file.
+ String
+ -- ^ Module name
+ (BB.Builder -> Fold a BB.Builder)
+ -- ^ Function that generate the module, given the module name.
+ | TestOutputRecipe
+ -- ^ A recipe to create a test output file.
+ String
+ -- ^ Test name
+ (Fold a BB.Builder)
+ -- ^ Test output generator
+
+-- ModuleRecipe is a tuple of the module name and a function that generates the
+-- module using the module name
+type ModuleRecipe a = (String, BB.Builder -> Fold a BB.Builder)
+type TestOutputRecipe a = (FilePath, Fold a BB.Builder)
+
+-- GeneratorRecipe is a list of ModuleRecipe
+type GeneratorRecipe a = [FileRecipe a]
+
+moduleFileEmitter :: Maybe FilePath -> FilePath -> ModuleRecipe a -> Fold a (IO ())
+moduleFileEmitter mfile outdir (modName, fldGen) = rmapFold action $ fldGen (BB.string7 modName)
+
+ where
+
+ pretext version = case mfile of
+ Just file -> mconcat
+ [ "-- DO NOT EDIT: This file is automatically generated by the internal tool ucd2haskell,\n"
+ , "-- with data from: https://www.unicode.org/Public/"
+ , BB.string7 version
+ , "/ucd/"
+ , BB.string7 file
+ ,".\n\n"
+ ]
+ Nothing -> "-- DO NOT EDIT: This file is automatically generated by the internal tool ucd2haskell.\n\n"
+ outfile = outdir </> moduleToFileName modName <.> ".hs"
+ outfiledir = dirFromFileName outfile
+ action c = do
+ version <-
+ catch
+ (getEnv "UNICODE_VERSION")
+ (\(_ :: IOException) -> return "<unknown>")
+ createDirectoryIfMissing True outfiledir
+ B.writeFile outfile (BL.toStrict (BB.toLazyByteString (pretext version <> c)))
+
+testOutputFileEmitter :: FilePath -> TestOutputRecipe a -> Fold a (IO ())
+testOutputFileEmitter outdir (name, fldGen) = rmapFold action fldGen
+
+ where
+
+ outfile = outdir </> "tests" </> name <.> ".stdout"
+ outfiledir = dirFromFileName outfile
+ action c
+ = createDirectoryIfMissing True outfiledir
+ *> B.writeFile outfile (BL.toStrict (BB.toLazyByteString c))
+
+runGenerator ::
+ FilePath
+ -> FilePath
+ -> (B.ByteString -> [a])
+ -> FilePath
+ -> GeneratorRecipe a
+ -> IO ()
+runGenerator indir file transformLines outdir recipes = do
+ raw <- B.readFile (indir <> file)
+ sequence_ (runFold combinedFld (transformLines raw))
+
+ where
+
+ generatedFolds = recipes <&> \case
+ ModuleRecipe name f -> moduleFileEmitter (Just file) outdir (name, f)
+ TestOutputRecipe name f -> testOutputFileEmitter outdir (name, f)
+ combinedFld = distribute generatedFolds
+
+genModules :: FilePath -> FilePath -> [BS.ShortByteString] -> IO ()
+genModules indir outdir props = do
+ genUnicodeVersion outdir
+
+ runGenerator
+ indir
+ "UnicodeData.txt"
+ UD.parse
+ outdir
+ [ generalCategory
+ , simpleUpperCaseMapping
+ , simpleLowerCaseMapping
+ , simpleTitleCaseMapping
+ ]
+
+ runGenerator
+ indir
+ "DerivedCoreProperties.txt"
+ P.parse
+ outdir
+ [ derivedCoreProperties ]
+
+ where
+
+ derivedCoreProperties = ModuleRecipe
+ "GHC.Internal.Unicode.Char.DerivedCoreProperties"
+ (`genCorePropertiesModule` (`elem` props))
+
+ generalCategory = ModuleRecipe
+ "GHC.Internal.Unicode.Char.UnicodeData.GeneralCategory"
+ genGeneralCategoryModule
+
+ simpleUpperCaseMapping = ModuleRecipe
+ "GHC.Internal.Unicode.Char.UnicodeData.SimpleUpperCaseMapping"
+ (\m -> genSimpleCaseMappingModule m "toSimpleUpperCase" UD.simpleUpperCaseMapping)
+
+ simpleLowerCaseMapping = ModuleRecipe
+ "GHC.Internal.Unicode.Char.UnicodeData.SimpleLowerCaseMapping"
+ (\m -> genSimpleCaseMappingModule m "toSimpleLowerCase" UD.simpleLowerCaseMapping)
+
+ simpleTitleCaseMapping = ModuleRecipe
+ "GHC.Internal.Unicode.Char.UnicodeData.SimpleTitleCaseMapping"
+ (\m -> genSimpleCaseMappingModule m "toSimpleTitleCase" UD.simpleTitleCaseMapping)
=====================================
libraries/ghc-internal/tools/ucd2haskell/ucd2haskell.cabal
=====================================
@@ -1,6 +1,6 @@
cabal-version: 2.2
name: ucd2haskell
-version: 0.3.0
+version: 0.4.0
synopsis: Converter from Unicode character database to Haskell.
description:
The Haskell data structures are generated programmatically from the
@@ -10,12 +10,12 @@ description:
license: BSD-3-Clause
license-file: LICENSE
author: Composewell Technologies and Contributors
-maintainer: streamly at composewell.com
-copyright: 2020 Composewell Technologies and Contributors
+maintainer: The GHC Developers <ghc-devs at haskell.org>
+bug-reports: https://gitlab.haskell.org/ghc/ghc/-/issues/new
category: Data,Text,Unicode
stability: Experimental
build-type: Simple
-tested-with: GHC==9.2.2
+tested-with: GHC==9.8.2
extra-source-files:
README.md
@@ -23,18 +23,14 @@ extra-source-files:
common default-extensions
default-extensions:
- BangPatterns
DeriveGeneric
- MagicHash
- RecordWildCards
- ScopedTypeVariables
- TupleSections
- FlexibleContexts
-
- -- Experimental, may lead to issues
DeriveAnyClass
- TemplateHaskell
- UnboxedTuples
+ ExistentialQuantification
+ LambdaCase
+ OverloadedStrings
+ OverloadedRecordDot
+ ScopedTypeVariables
+ RecordWildCards
common compile-options
ghc-options: -Wall
@@ -42,21 +38,20 @@ common compile-options
-fwarn-incomplete-record-updates
-fwarn-incomplete-uni-patterns
-fwarn-tabs
- default-language: Haskell2010
+ default-language: GHC2021
executable ucd2haskell
import: default-extensions, compile-options
- default-language: Haskell2010
ghc-options: -O2
hs-source-dirs: exe
main-is: UCD2Haskell.hs
- other-modules: Parser.Text
+ other-modules: UCD2Haskell.ModuleGenerators
build-depends:
- base >= 4.7 && < 4.20
- , streamly-core >= 0.2.2 && < 0.3
- , streamly >= 0.10 && < 0.11
- , split >= 0.2.3 && < 0.3
- , getopt-generics >= 0.13 && < 0.14
- , containers >= 0.5 && < 0.7
- , directory >= 1.3.6 && < 1.3.8
- , filepath >= 1.4.2 && < 1.5
+ base >= 4.7 && < 5
+ , bytestring >= 0.11 && < 0.13
+ , containers >= 0.5 && < 0.7
+ , directory >= 1.3.6 && < 1.3.8
+ , filepath >= 1.4.2 && < 1.5
+ , getopt-generics >= 0.13 && < 0.14
+ , split >= 0.2.3 && < 0.3
+ , unicode-data-parser >= 0.2.0 && < 0.4
View it on GitLab: https://gitlab.haskell.org/ghc/ghc/-/commit/7b23ce8b48a169ba15d5ab585505f43d71363d38
--
View it on GitLab: https://gitlab.haskell.org/ghc/ghc/-/commit/7b23ce8b48a169ba15d5ab585505f43d71363d38
You're receiving this email because of your account on gitlab.haskell.org.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.haskell.org/pipermail/ghc-commits/attachments/20240613/232635f3/attachment-0001.html>
More information about the ghc-commits
mailing list