Technicalities

If you understand Scripts, fonts, Unicode etc. and generally consider yourself computer competent or a computer geek or a nerd or a guru, you can completely ignore this page. But since you are here, I would request you to read this document and comment on it.

Script and Language

There are many scripts in the world. http://en.wikipedia.org/wiki/ISO_15924 has the list of scripts standardized by ISO. The scripts relate to languages in many-to-many correspondence. We have many examples of one script used to write different languages. E.g.

· Latin Script (ISO:LATN) is used to write English, French, German etc.

· Arabic Script (ISO:ARAB) is used to write Arabic, Farsi, Urdu, Pashto etc.

· Devanagari Script (ISO:DEVA) is used to write Sanskrit, Hindi, Marathi, Konkani, Nepali etc.

There are languages which can be written in different scripts. E.g.

· Panjabi in India is written is Gurumukhi Script (ISO:GURU), but is written in Arabic Script (ISO:ARAB) in Pakistan.

· Serbian is written in Latin Script (ISO:LATN) or Cyrillic (ISO:CYRL) depending upon the geographical area.

· English which is generally written in Latin Script (ISO:LATN), is also written in Braille Script (ISO:BRAI).

One must consciously think about a script as a way of representing the language without speaking.

लिपी आणि भाषा याविषयी थोडे मराठीत...

Script मह्णजे लिपी. एका लिपीत अनेक भाषा लिहिता येतात. ISO (International Standards Organization) ने प्रत्येक लिपीला चार अक्षरी code दिला आहे. उदा.

· LATN = लॅटिन लिपी: इंग्लिश, फ्रेंच, जर्मन इ. भाषा ही लिपी वापरतात.

· ARAB = अरबी लिपी: अरबी, फारसी, उर्दू, पुश्तू भाषा ही लिपी वापरतात

· DEVA = देवनागरी लिपी: संस्कृत, मराठी, हिंदी, कोकणी, नेपाळी भाषा ही लिपी वापरतात.

काही भाषा एकाहून अधीक लिप्या वापरतात उदा.

· पंजाबी: भारतात गुरुमुखी (ISO:GURU) तर पाकिस्तानात अरबी (ISO:ARAB) लिपी वापरतात.

· सर्बियन : देशाच्या प्रांतांप्रमाणे लॅटिन (ISO:LATN) किंवा सिरिलिक (ISO:CYRL) लिपी वापरतात.

· इंग्लिश : सर्वसाधाराणपणे लॅटिन (ISO:LATN), पण अंध व्यक्तींच्या पुस्तकांकरिता ब्रेल (ISO:BRAI) लिपी वापरतात.

लिपी म्हणजे एखादी भाषा न बोलता इतरांपर्यंत पोहोचवण्याचा मार्ग हे लक्षात ठेवणे महत्त्वाचे.

Unicode

In early 1960s to 1980s, computers were almost exclusively used by scientific and engineering community for heavy calculations. A unit of storage in computers is a byte, which consists of 8 bits. Each bit can be either ON or OFF, represented as 1 or 0. In a set of 8 bits, i.e. a byte, 256 numbers can be represented, which is 2⁸ possible combinations of 1s 0s. This is basically called as Binary Numeral System. For excellent information on Binary Numeral system, please refer to the Wikipedia Article http://en.wikipedia.org/wiki/Binary_numeral_system . In the earlier computers, there were no display screens or printers, the computer used a series of lights (On or OFF) to display the computer answer.

People were not happy with just numbers and blinking lights but electric typewriters and printers were available. Computer folks came up with the concept of representing characters with numbers. They designed the first encodings and the encoding called ASCII was adopted widely. ASCII used the 256 possible numbers and mapped characters to them, e.g. character ‘A’ was represented by number 65 and so on. Then they wrote computer instructions to print the shape of character ‘A’ whenever the binary number 65 came up. The 256 available characters encoded all the characters and punctuations required by English and some characters that looked graphical, to be able to print boxes etc. Later display screens were developed and the printer technology was extended to these displays.

People in France and Russia and other countries started to use computers and found out that they needed place for their characters, so they started using some characters from the 256 to represent characters specific to French or Ruissian. Soon other languages joined the party and wanted their share of characters in the available 256. So there came a situation when character code number 195 represented Thai character Ro Rua, Greek Capital Letter Gamma, Latin Capital Letter A with Tilde, etc. depending on how you looked at it, and one could not reliably tell what the text was unless one knew in advance, what language one was trying to read.

So over the years people of the world came together and created the Unicode standard where every character got its own unique value. Of course they could not fit everything in 256 characters so they decided to look at 2 bytes i.e. 16 bits together giving a possible range of 65536 characters and all scripts get their own space. With Unicode, all the data can now coexist in one file or document or communication without being misinterpreted.

Since Unicode takes up twice the space, people did not adopt it easily. Those who had data in only one language and never sent in internationally did not care, and kept on using the single byte systems. In the last 10 years or so, the cost of data storage (hard disk) has come down, communication speed has increased and the file size being twice does not matter anymore. UTF8 is a popular format for storing and sharing Unicode text.

Unicode is not a linguistic standard, but it standardizes scripts. Recently, it was acknowledged that 65536 characters were not enough for all scripts of the world, so now Unicode has extended itself to be 32 bits, allowing over 4 billion characters.

Unicode Consortium’s web site is http://www.unicode.org , please visit it to get a better understanding of Unicode, better than what I have summarized in a few paragraphs above.

युनिकोड विषयी थोडेसे मराठीत...

अगदी सुरुवातीला, म्हणजे १९६०-१९७० या काळात कॉंप्युटरचा उपयोग शास्त्रज्ञ व इंजिनियर मोठाली गणिते करण्याकरता करायचे. कॉंप्युटरमध्ये सर्वकाही बाइट्स मध्ये मोजले जाते. ८ बिट्सचा १ बाईट बनतो. एका बिट् मध्ये ० किंवा १ मावतो (बंद किंवा चालू). असे ८ बिट्स एकत्र वाचले तर २ X २ X २ X २ X २ X २ X २ X २ = २५६ वेगवेगळ्या संख्या बसवता येतात. जुन्या काळच्या कॉंप्युटरला स्क्रीन किंवा प्रिंटर नव्हते, त्यामुळे वेगवेगळे दिवे चालू व बंद करून उत्तर दाखवले जायचे. अर्थातच, हे दिवे बघून त्याचा अर्थ लावायाला कॉंप्युटर तज्ज्ञच लागायचे.

लोकांना हे फार रुचले नाही, आणि इलेक्ट्रिक टाइपरायटर उपलब्द्ध होते. कॉंप्युटरमधील हे बाइट्स टाइपरायटरवर पाठवण्याकरता लोकांनी कुठला बाईट म्हणजे कुठले अक्षर अशी मांडणी केली. ASCII ही मांडणी खूपच लोकप्रिय झाली. ASCII मध्ये ‘A’ ला ६५, ‘B’ ला ६६, इ. असे क्रमांक दिले गेले आणि कॉंप्युटरमधून ६५ आला की त्या स्पेशल टाइपरायटरवर ‘A’ छापला जायचा. अशा तर्‍हेने कॉंप्युटरमधून आलेली उत्तरे सर्वसामान्य लोकांना वाचता येऊ लागली. अशा टाइपरायटरला प्रिंटर असे नाव दिले गेले. हेच तंत्र वापरून पहिले मॉनिटर्स तयार केले गेले.

रशिया, फ्रान्स इ. देशातील लोकांनाही आपापली अक्षरे छापायची होती, आणि जागा होत्या फक्त २५६, त्यमुळे एकाच आकड्याला वेगवेगळ्या लिप्या वेगवेगळी अक्षरे समजू लागल्या. आणि मग नुसता आकड्यांना काहीच अर्थ राहिला नाही. उदा. १९५ म्हणजे “थाई अक्षर रो-रुआ, की ग्रीक अक्षर कॅपिटल गॅमा ?” असा गोंधळ हऊ लागला.

जगातले कॉंप्युटरतज्ज्ञ एकत्र आले आणि युनिकोड हे प्रमाण तयार केले. १ ऐवजी २ बाइट्स् वापरून ६५५३६ अक्षरांकरता अढळ जागा तयार केल्या आणि जगातल्या सर्व लीप्यांमधल्या सर्व वेगवेगळ्या प्रत्येक अक्षराला स्वतंत्र स्थान दिले गेले. हल्लीच ६५५३६ अक्षरेही भरल्यामुळे युनिकोड आता ४ बाईट वापरायला लागले आहे, त्यामुळे ४ अब्जांहून अधिक वेगवेगळ्या अक्षरांना आता जागा झालेली आहे..

Transliteration and Phonetics

Generally languages are written in the script that they are traditionally used for them. For example, the traditional way to write the word “knowledge” is using LATIN script, but one can write “नॉलेज” in Devanagari or “నోలేజ్” in Telugu. Transliteration is writing a language in the script which is not traditionally used for the language.

Due to the widespread use of English in India and elsewhere, we have been accustomed to the phonetic association of Latin Script characters to transliterate different languages into Latin Script. It does not provide correct pronunciation but we have been basically compromising the phonetic accuracy in favor of being able to use the existing technology like typewriters and printers which did not support anything other than Latin script. In reality there is *no way* to accurately transliterate many sounds in Latin Script, e.g. my name मकरंद गद्रे. I have always been using the spelling “Makarand Gadre”, but this is certainly phonetically incorrect. People who do not know Marathi (or some other indic language) pronounce this as “मॅकॅरॅंड् गेडर्” or “मॅकॅरॅंड् गॅडर्”. Devanagari script is close to being a perfect phonetic script though it misses some spoken sounds like the clicking sounds of the bushman languages (see the movie “Gods Must Be Crazy”) or the intonations used in Chinese.

Fonts

Fonts are data files that define visual shapes of characters. Here is an example of the same text in different fonts.

The last two examples are interesting, because both show the Latin script text (i.e. English text) in practically unreadable form using the font Wingdings and Shusha. During the pre-unicode days, fonts were created which basically used the idea of Wingdings and put different shapes for Latin characters. Older Marathi / Devanagri fonts like Susha and Shivaji did the same, i.e. they put pictures that looked like Marathi characters to display English characters. This is called Font based encoding, which, in other words, means that you cannot read the text unless you have the font to go with it. If you do not have the font, the data will be basically mistaken to be gibberish text in Latin script.

If the data is saved in Unicode, the data retains its identity even if the font does not have a visual representation of the character. In such cases a rectangular box is displayed on the screen.

Keyboard

As we all know, a computer keyboard has a number of push button switches which work like a doorbell switch. Every time a key is pressed, the computer receives the internal Key number associated with the key that was pressed. On a standard QWERTY keyboard, if ‘Q’ key is pressed, the computer receives the information that Key Number 16 is pressed. At this point, the computer consults a “lookup table” which is used to translate this key number to a character. If the current keyboard language is set to English, corresponding “lookup table” is used and the computer interprets this key number 16 as character ‘Q’ and sends it to the program. However if the current keyboard is set to say, German, the Key 16 is interpreted as character ‘Z’. http://msdn.microsoft.com/en-us/library/aa299374(VS.60).aspx has details about the key numbers for all the keys. These “lookup tables” are called “Keyboard Layouts”.