W3C

Requirements for Indian Languages in CSS technology

ABNF Valid Segmentation Version 1.0

Abstract

This document describes the requirements for Indian Languages layout to be realized with CSS technology. It shows some of the major issues in CSS for Indian languages (Hindi, Malayalam, Odia and Punjabi).It also describe the definition of ABNF Valid Segmentation to overcome the limitation of some of the CSS Issues like Styling of first letter pseudo-element, Vertical & Horizontal Alignment arrangements of characters, Unicode Text Segmentation, Unicode Line Breaking in Indic Languages. Based on this study, definition of ABNF Valid Segmentation has been evolved to address these issues.


Table of Contents

Objective
1.Introduction
1.1What is CSS
2.Basics Composition of Indic Languages
2.1Hindi
2.2Bengali
2.3Gujarati
2.4Kannada
2.5Tamil
2.6Malayalam
2.7Marathi
2.8Telugu
3.Issues
3.1Styling of first letter pseudo-element
3.1.1Examples for Hindi language
3.1.2Bengali
3.1.3Malayalam
3.1.4Gujarati
3.2Vertical arrangements of characters
3.3Horizontal spacing
3.4Unicode Text Segmentation UAX #29
3.5Unicode Line Breaking Algorithm UAX #14
4.Summary of CSS issues in Indic languages
5.Proposed Solution for CSS Issues in Indic Languages
5.1Needs of ABNF Valid Segmentation
6.Comparisons of ABNF Valid Segmentation definition in Hindi, Odia, Malayalam, Punjabi
7.Future Action
8.Contributors
9.References
10.Annexure I. CSS 3 properties for ABNF Applicability

Objective

The main objective of this document is to covers the definition of ABNF Valid Segmentation for all 22 official languages of India with examples and to get the desired solution of text segmentation issues in CSS.

1. Introduction

1.1 What is CSS

CSS is the abbreviation for Cascading Style Sheet. A style sheet simply holds a collection of rules that we define to enable us to manipulate our web pages. CSS can be applied to our web pages in many ways; however the most powerful way to employ CSS rules is from an external cascading style sheet. When used in this manner, the full power of CSS can be used to control the design and appearance of our work from a single controlling location, which makes it easy to update our site on a global basis. Each cultural community has its own language, script and writing system. In that sense, the transfer of each writing system into cyberspace is a task with very high importance for information and communication technology. This document describes issues of text composition in eight Indian Languages layout requirements for Hindi, Bengali, Kannada, Guajarati, Marathi, Malayalam Tamil and Telugu.

2. Basics Composition of Indic Languages

2.1 Hindi

2.2 Bengali

2.3 Gujarati

2.4 Kannada

2.5 Tamil

2.6 Malayalam

2.7 Marathi

2.8 Telugu

3. Issues

3.1 Styling of first letter pseudo-element

The first-letter pseudo-element represents the first letter of the first line of a block, if it is not preceded by any other content (such as images or inline tables) on its line. It allows that first letter to be styled individually, without markup. It may be used for "initial caps" and "drop caps", which are common typographical effects in text in Latin script. Drop initial is a typographic effect emphasizing the initial letter(s) of a block element with a presentation similar to a 'floated' element. The drop initial effect may also be used for writing systems which use different alignment strategies. For example, in Devanagari the hanging baseline may be preferred. In that case the primary connection point connects the text-after-edge of the initial letter with the text-after-edge of the nth line, but the secondary connection point connects the hanging baselines of the initial letter and the initial line.

3.1.1 Examples for Hindi language

If some styling feature is to be applied to the starting character, then whether it will be applied to a single character, conjunct character, a syllable or a Grapheme cluster.

The first-letter pseudo-element represents the first letter of the first line of a block, if it is not preceded by any other content (such as images or inline tables) on its line. It allows that first letter to be styled individually, without markup. It may be used for "initial caps" and "drop caps" Indic script behavior relates to syllables, rather than individual letter forms. In the Hindi word स्थिति ('sthiti') the sequence of characters in the first syllable is as follows in memory:

0938: स DEVANAGARI LETTER SA
094D: ् DEVANAGARI SIGN VIRAMA
0925: थ DEVANAGARI LETTER THA
093F: ि DEVANAGARI VOWEL SIGN I

Note how the vowel sign appears to the left of the first character, not the third. There are two default grapheme clusters here. The first includes the SA+VIRAMA+THA+I. (The second is the last two characters, T+II.) From the feedback we have received it appears that first-letter styling will be needed for Indic scripts. We have examples in the mail archive for such styling in Devanagari, Bengali, and Malayalam, though we have reports that it is needed for other scripts, such as Telugu, Tamil and Kannada. We see that the styling is done on the basis of the syllable, not the first character. A syllable includes a base consonant and any combination of the following characters in the text stream:

• Consonants preceded by virama (i.e. conjuncts).
• vowel signs
• Visarga, anusvara or candrabindu.

3.1.2 Bengali

3.1.3 Malayalam

3.1.4 Gujarati

3.2 Vertical arrangements of characters

Presentation / Styling issues: Vertical arrangement of characters If some string is written in vertical mode, then writing each character on a new line may not be suitable, Styling like vertical arrangement of the character in Hindi


When this issue was first discussed, there were queries that whether Indic scripts (Devanagari etc.) are written in this fashion and will be of use anywhere.

3.3 Horizontal spacing

Same thing applies to horizontal spacing as well for Indic languages Styling issues like the Horizontal spacing between characters like C E R T I F I C A T E the space is given between the every character in case of English. But in case of Indian language like Bangla, Assamese etc the space may given not in every character but after some portion of the character sequence as in figure below:

3.4 Unicode Text Segmentation UAX #29

Word Boundaries (Hyphenation) : Word boundaries are used in a number of different contexts. The most familiar ones are selection (double-click mouse selection, or “move to next word” control-arrow keys), and “Whole Word Search” for search and replace. They are also used in database queries, to determine whether elements are within a certain number of words of one another.

Recommended solution: ABNF Valid Segmentation and hyphenation dictionary (if available)

Sentence Boundaries
Recommended solution: Some special sentence boundaries like the double poorna virama, possibly with numbers (as in Sanskrit text, shlokas etc.) A string of Unicode-encoded text often needs to be broken up into text elements programmatically. Common examples of text elements include what users think of as characters, words, lines (more precisely, where line breaks are allowed), and sentences. The precise determination of text elements may vary according to orthographic conventions for a given script or language. The goal of matching user perceptions cannot always be met exactly because the text alone does not always contain enough information to unambiguously decide boundaries. For example, the period (U+002E FULL STOP) is used ambiguously, sometimes for end-of-sentence purposes, sometimes for abbreviations, and sometimes for numbers. In most cases, however, programmatic text boundaries can match user perceptions quite closely, although sometimes the best that can be done is not to surprise the user.

Solution

Grapheme Cluster Boundaries: ABNF Valid Segmentation Based, Possible Extension for handling some cases (?)
Deletion and backspace: Code point wise as well as ABNF Valid Segmentation
Mouse Selection: At ABNF Valid Segmentation and code point level

Example of double click mouse selection :

Internet Explorer

wrdbrk_ie

Google Chrome

wrdbrk_crom

Mozilla Firefox

wrdbrk_moz

3.5 Unicode Line Breaking Algorithm UAX #14-(Word wrapping)

Solution
(Characters not starting a line): A line should not begin with the characters shown below:
• closing brackets (cl-02),
• hyphens (cl-03),
• dividing punctuation marks (cl-04),
• middle dots (cl-05),
• full stops (cl-06),
• commas (cl-07),
• iteration marks (cl-09),

Line Breaking for Doha

• Poorna viram । U+0964
• Double Poorna viram ॥ U+0965

'word-wrap' ::This CSS 3 property specifies whether the current rendered line should break if the content exceeds the boundary of the specified rendering box for an element

Internet Explorer

wrdwrp_ie

Google Chrome

wrdwrp_crom

Mozilla Firefox

wrdwrp_moz

-------------------------------------------------------------------------------------------------------------------------------------
'word-break' ::This CSS 3 property specifies whether the current rendered line should break if the content exceeds the boundary of the specified rendering box for an element

Internet Explorer

wrdbrk_ie

Google Chrome

wrdbrk_crom

Mozilla Firefox

wrdbrk_moz

4. Summary of CSS issues in Indic languages

S. NO

STYLING ISSUES

DESCRIPTION

SUGGESTIONS

    1.

First Character and Drop Initial overview

Issues for Indian Languages with respect to first character used in Hindi, Malayalam, Bengali, Tamil and Punjabi etc as shown in example 3.1 (Issue in Indian Languages ).
Drop initial is a typographic effect emphasizing the initial letter(s) of a block element with a presentation similar to a 'floated' element. Examples are available in Hindi, Bengali, Malayalam, Marathi, Gujarati, and Punjabi. For detail see section 3.1

It appears that first-letter styling will be needed for Indic scripts. Examples for such styling in Devanagari, Bengali, Malayalam, Tamil and Gurumukhi scripts are available, but it is needed for other scripts, such as Telugu and Kannada.
In W3C standards for Drop Initial is available for Hindi and English. However, it is very with various Indian Languages such as Marathi, Bengali, Malayalam, Punjabi, and Guajarati etc. Therefore some changes may be required to be implemented in CSS standards developed by W3C with respect to Indian languages.

    2.

Vertical arrangements of characters

Vertical arrangement of characters if some string is written in vertical mode, then writing each character on a new line may not be suitable. Example shown in section 3.2

Some changes may be required to be implemented in CSS standards developed by W3C that how to use styles vertically in Indian languages.

    3.

Horizontal spacing

How to represent horizontal spacing in Indian Languages. Example shown in section 3.3

Some changes may be required to be implemented in CSS standards developed by W3C that how to use styles horizontally in Indian languages.

    4.

Unicode Text Segmentation   UAX #29

Word Boundaries (Hyphenation) & Sentence Boundaries
A string of Unicode-encoded text often needs to be broken up into text elements programmatically. Common examples of text elements include what users think of as characters, words, lines (more precisely, where line breaks are allowed), and sentences. Example shown in section 3.4

Some changes may be required for text segmentation are:
Grapheme Cluster  Boundaries:

  • ABNF Valid Segmentation Based, Possible Extension for handling some cases (?)

Deletion and backspace:

  • Code point wise as well as ABNF Valid Segmentation 

Mouse Selection:

  • At ABNF Valid Segmentation   and code point level
    5.

Unicode Line Breaking Algorithm UAX #14

Word wrapping, Word break
Example shown in section 3.5

A line should not begin with the characters shown below:
•closing brackets (cl-02),
•hyphens (cl-03),
•dividing punctuation marks (cl-04),
•middle dots (cl-05),
•full stops (cl-06),
•commas (cl-07),
•iteration marks (cl-09),
Line Breaking for Doha
•Poorna  viram  ।  U+0964
•Double  Poorna  viram  ॥  U+0965
For Future
•Verse  number:      ।१।    ,     ॥ १॥

For more details view this document: http://w3cindia.in/cssdocument.html

5. Proposed Solution for CSS Issues in Indic Languages

5.1 Needs of ABNF Valid Segmentation

ABNF Valid Segmentation is used to solve the following CSS issues.CSS Issues finds in language like Hindi, bangle, Punjabi, Malayalam, Tamil, Oriya, Guajarati, Marathi are:-

• Styling of first letter pseudo-element
• Vertical & Horizontal Alignment arrangements of characters
• Horizontal spacing
• Unicode Text Segmentation UAX #29
• Unicode Line Breaking Algorithm UAX #14

6. Comparisons of ABNF Valid Segmentation definition in Hindi, Odia, Malayalam, Punjabi

Definition: - Vm?|(CN?H?)*CN?(H|v?m?)

V (upper case) is any independent vowel
m is any vowel modifier (Devanāgari Anusvāra, Visarga, Candrabindu)
C is any consonant (with inherent vowel)
N is Nukta
H is halant or Virāma
v (lower case) is any dependent vowel sign (mātrā)

7. Future Action

1. The main objective of this document is to cover the definition of ABNF Valid Segmentation in remaining Indian languages.

2. Standardization of ABNF Valid Segmentation after cover the definition of ABNF Valid Segmentation in all Indic languages.

3. After the finalizing for all the Indic languages. We will send these inputs to W3C for review.

8. Contributors

S. No.

Name

Organization

1

Swaran Lata

W3C India

2

Gautam Sengupta

University of Hyderabad

3

Rajeev Sangal

IIT Hyderabad

4

Dipti Misra Sharma

IIT Hyderabad

5

Anil Kumar Singh

IIT Hyderabad

6

R K Sharma

Thapar University

7

Rajat Mohanty

IIT Bombay

8

Venkatesh Choppella

IIT Hyderabad

9

Soma Paul

IIT Hyderabad

10

Panchanan Mohanty

University of Hyderabad

11

G. Uma Maheshwar Rao

University of Hyderabad

12

Somanth Chandra

W3C India

13

Prashant Verma

W3C India

14

Prashant Tyagi

W3C India

15

Naitik Tyagi

W3C India

9. References

1. first-letter in non-Latin scripts URL : http://www.w3.org/International/notes/firstletter.html

2. Michel Suignard (Microsoft), Eric A. Meyer , URL : http://www.w3.org/TR/2002/WD-css3-linebox-20020515/

3. http://www.w3.org/blog/International/2006/01/20/request_for_feedback_usefulness_of_first

4. http://osdir.com/ml/web.css.general/2006-05/msg00010.html

5. http://osdir.com/ml/web.css.general/2006-05/msg00011.html

6. Images of Indian Languages :

6.1  http://www.vaarttha.com/pages/main/MAIN-9.pdf

6.2   http://www.gujarattimes.com/Registernew.aspx

7. http://www.kolkatacdac.in/

8. http://www.cdacnoida.in/

9. http://unicode.org/reports/tr29/

10. http://unicode.org/reports/tr14/

11. http://www.jepa.or.jp/press_release/reqEPUBJ.html



Annexure I. CSS 3 properties for ABNF Applicability

NameValues
':first-letter (CSS selector)'

The following properties apply to the "first-letter" pseudo- element: 

  • font properties
  • margin properties
  • vertical-align (only if "float" is "none")
  • text-transform
  • line-height
'letter-spacing' normal | <length> | inherit
'word-break' normal | keep-all | break-all
'word-wrap' normal | break-word