This document describes the requirements for Indian Languages layout to be realized with CSS technology. It shows some of the major issues in CSS for Indian languages (Hindi, Malayalam, Odia and Punjabi).It also describe the definition of ABNF Valid Segmentation to overcome the limitation of some of the CSS Issues like Styling of first letter pseudo-element, Vertical & Horizontal Alignment arrangements of characters, Unicode Text Segmentation, Unicode Line Breaking in Indic Languages. Based on this study, definition of ABNF Valid Segmentation has been evolved to address these issues.
Objective
1.Introduction
1.1What is CSS
2.Basics Composition of Indic Languages
2.1Hindi
2.2Bengali
2.3Gujarati
2.4Kannada
2.5Tamil
2.6Malayalam
2.7Marathi
2.8Telugu
3.Issues
3.1Styling of first letter pseudo-element
3.1.1Examples for Hindi language
3.1.2Bengali
3.1.3Malayalam
3.1.4Gujarati
3.2Vertical arrangements of characters
3.3Horizontal spacing
3.4Unicode Text Segmentation UAX #29
3.5Unicode Line Breaking Algorithm UAX #14
4.Summary of CSS issues in Indic languages
5.Proposed Solution for CSS Issues in Indic Languages
5.1Needs of ABNF Valid Segmentation
6.Comparisons of ABNF Valid Segmentation definition in Hindi, Odia, Malayalam, Punjabi
7.Future Action
8.Contributors
9.References
10.Annexure I. CSS 3 properties for ABNF Applicability
The main objective of this document is to covers the definition of ABNF Valid Segmentation for all 22 official languages of India with examples and to get the desired solution of text segmentation issues in CSS.
CSS is the abbreviation for Cascading Style Sheet. A style sheet simply holds a collection of rules that we define to enable us to manipulate our web pages. CSS can be applied to our web pages in many ways; however the most powerful way to employ CSS rules is from an external cascading style sheet. When used in this manner, the full power of CSS can be used to control the design and appearance of our work from a single controlling location, which makes it easy to update our site on a global basis. Each cultural community has its own language, script and writing system. In that sense, the transfer of each writing system into cyberspace is a task with very high importance for information and communication technology. This document describes issues of text composition in eight Indian Languages layout requirements for Hindi, Bengali, Kannada, Guajarati, Marathi, Malayalam Tamil and Telugu.
The first-letter pseudo-element represents the first letter of the first line of a block, if it is not preceded by any other content (such as images or inline tables) on its line. It allows that first letter to be styled individually, without markup. It may be used for "initial caps" and "drop caps", which are common typographical effects in text in Latin script. Drop initial is a typographic effect emphasizing the initial letter(s) of a block element with a presentation similar to a 'floated' element. The drop initial effect may also be used for writing systems which use different alignment strategies. For example, in Devanagari the hanging baseline may be preferred. In that case the primary connection point connects the text-after-edge of the initial letter with the text-after-edge of the nth line, but the secondary connection point connects the hanging baselines of the initial letter and the initial line.
If some styling feature is to be applied to the starting character, then whether it will be applied to a single character, conjunct character, a syllable or a Grapheme cluster.
The first-letter pseudo-element represents the first letter of the first line of a block, if it is not preceded by any other content (such as images or inline tables) on its line. It allows that first letter to be styled individually, without markup. It may be used for "initial caps" and "drop caps"
Indic script behavior relates to syllables, rather than individual letter forms. In the Hindi word स्थिति ('sthiti') the sequence of characters in the first syllable is as follows in memory:
0938: स DEVANAGARI LETTER SA
094D: ् DEVANAGARI SIGN VIRAMA
0925: थ DEVANAGARI LETTER THA
093F: ि DEVANAGARI VOWEL SIGN I
Note how the vowel sign appears to the left of the first character, not the third.
There are two default grapheme clusters here. The first includes the SA+VIRAMA+THA+I. (The second is the last two characters, T+II.)
From the feedback we have received it appears that first-letter styling will be needed for Indic scripts. We have examples in the mail archive for such styling in Devanagari, Bengali, and Malayalam, though we have reports that it is needed for other scripts, such as Telugu, Tamil and Kannada.
We see that the styling is done on the basis of the syllable, not the first character. A syllable includes a base consonant and any combination of the following characters in the text stream:
• Consonants preceded by virama (i.e. conjuncts).
• vowel signs
• Visarga, anusvara or candrabindu.
Presentation / Styling issues: Vertical arrangement of characters If some string is written in vertical mode, then writing each character on a new line may not be suitable, Styling like vertical arrangement of the character in Hindi
When this issue was first discussed, there were queries that whether Indic scripts (Devanagari etc.) are written in this fashion and will be of use anywhere.
Same thing applies to horizontal spacing as well for Indic languages Styling issues like the Horizontal spacing between characters like C E R T I F I C A T E the space is given between the every character in case of English. But in case of Indian language like Bangla, Assamese etc the space may given not in every character but after some portion of the character sequence as in figure below:
Word Boundaries (Hyphenation) : Word boundaries are used in a number of different contexts. The most familiar ones are selection (double-click mouse selection, or “move to next word” control-arrow keys), and “Whole Word Search” for search and replace. They are also used in database queries, to determine whether elements are within a certain number of words of one another.
Recommended solution: ABNF Valid Segmentation and hyphenation dictionary (if available)
Sentence Boundaries
Recommended solution: Some special sentence boundaries like the double poorna virama, possibly with numbers (as in Sanskrit text, shlokas etc.)
A string of Unicode-encoded text often needs to be broken up into text elements programmatically. Common examples of text elements include what users think of as characters, words, lines (more precisely, where line breaks are allowed), and sentences. The precise determination of text elements may vary according to orthographic conventions for a given script or language. The goal of matching user perceptions cannot always be met exactly because the text alone does not always contain enough information to unambiguously decide boundaries. For example, the period (U+002E FULL STOP) is used ambiguously, sometimes for end-of-sentence purposes, sometimes for abbreviations, and sometimes for numbers. In most cases, however, programmatic text boundaries can match user perceptions quite closely, although sometimes the best that can be done is not to surprise the user.
Solution
Grapheme Cluster Boundaries: ABNF Valid Segmentation Based, Possible Extension for handling some cases (?)
Deletion and backspace: Code point wise as well as ABNF Valid Segmentation
Mouse Selection: At ABNF Valid Segmentation and code point level
Example of double click mouse selection :
Solution
(Characters not starting a line): A line should not begin with the characters shown below:
• closing brackets (cl-02),
• hyphens (cl-03),
• dividing punctuation marks (cl-04),
• middle dots (cl-05),
• full stops (cl-06),
• commas (cl-07),
• iteration marks (cl-09),
Line Breaking for Doha
• Poorna viram । U+0964
• Double Poorna viram ॥ U+0965
S. NO |
STYLING ISSUES |
DESCRIPTION |
SUGGESTIONS |
|
First Character and Drop Initial overview |
Issues for Indian Languages with respect to first character used in Hindi, Malayalam, Bengali, Tamil and Punjabi etc as shown in example 3.1 (Issue in Indian Languages ). |
It appears that first-letter styling will be needed for Indic scripts. Examples for such styling in Devanagari, Bengali, Malayalam, Tamil and Gurumukhi scripts are available, but it is needed for other scripts, such as Telugu and Kannada. |
|
Vertical arrangements of characters |
Vertical arrangement of characters if some string is written in vertical mode, then writing each character on a new line may not be suitable. Example shown in section 3.2 |
Some changes may be required to be implemented in CSS standards developed by W3C that how to use styles vertically in Indian languages. |
|
Horizontal spacing |
How to represent horizontal spacing in Indian Languages. Example shown in section 3.3 |
Some changes may be required to be implemented in CSS standards developed by W3C that how to use styles horizontally in Indian languages. |
|
Unicode Text Segmentation UAX #29 |
Word Boundaries (Hyphenation) & Sentence Boundaries |
Some changes may be required for text segmentation are:
Deletion and backspace:
Mouse Selection:
|
|
Unicode Line Breaking Algorithm UAX #14 |
Word wrapping, Word break |
A line should not begin with the characters shown below: |
ABNF Valid Segmentation is used to solve the following CSS issues.CSS Issues finds in language like Hindi, bangle, Punjabi, Malayalam, Tamil, Oriya, Guajarati, Marathi are:-
• Styling of first letter pseudo-element
• Vertical & Horizontal Alignment arrangements of characters
• Horizontal spacing
• Unicode Text Segmentation UAX #29
• Unicode Line Breaking Algorithm UAX #14
V (upper case) is any independent vowel
m is any vowel modifier (Devanāgari Anusvāra, Visarga, Candrabindu)
C is any consonant (with inherent vowel)
N is Nukta
H is halant or Virāma
v (lower case) is any dependent vowel sign (mātrā)
1. The main objective of this document is to cover the definition of ABNF Valid Segmentation in remaining Indian languages.
2. Standardization of ABNF Valid Segmentation after cover the definition of ABNF Valid Segmentation in all Indic languages.
3. After the finalizing for all the Indic languages. We will send these inputs to W3C for review.
S. No. |
Name |
Organization |
1 |
Swaran Lata |
W3C India |
2 |
Gautam Sengupta |
University of Hyderabad |
3 |
Rajeev Sangal |
IIT Hyderabad |
4 |
Dipti Misra Sharma |
IIT Hyderabad |
5 |
Anil Kumar Singh |
IIT Hyderabad |
6 |
R K Sharma |
Thapar University |
7 |
Rajat Mohanty |
IIT Bombay |
8 |
Venkatesh Choppella |
IIT Hyderabad |
9 |
Soma Paul |
IIT Hyderabad |
10 |
Panchanan Mohanty |
University of Hyderabad |
11 |
G. Uma Maheshwar Rao |
University of Hyderabad |
12 |
Somanth Chandra |
W3C India |
13 |
Prashant Verma |
W3C India |
14 |
Prashant Tyagi |
W3C India |
15 |
Naitik Tyagi |
W3C India |
1. first-letter in non-Latin scripts URL : http://www.w3.org/International/notes/firstletter.html
2. Michel Suignard (Microsoft), Eric A. Meyer , URL : http://www.w3.org/TR/2002/WD-css3-linebox-20020515/
3. http://www.w3.org/blog/International/2006/01/20/request_for_feedback_usefulness_of_first
4. http://osdir.com/ml/web.css.general/2006-05/msg00010.html
5. http://osdir.com/ml/web.css.general/2006-05/msg00011.html
6. Images of Indian Languages :
6.1 http://www.vaarttha.com/pages/main/MAIN-9.pdf
6.2 http://www.gujarattimes.com/Registernew.aspx
7. http://www.kolkatacdac.in/
8. http://www.cdacnoida.in/
9. http://unicode.org/reports/tr29/
10. http://unicode.org/reports/tr14/
11. http://www.jepa.or.jp/press_release/reqEPUBJ.html
| Name | Values |
|---|---|
| ':first-letter (CSS selector)' | The following properties apply to the "first-letter" pseudo- element:
|
| 'letter-spacing' | normal | <length> | inherit |
| 'word-break' | normal | keep-all | break-all |
| 'word-wrap' | normal | break-word |