Lexical correctness of an electronic corpus of Macedonian texts extracted from Internet

In recent times, the largest part of textual resources has been published on the Internet. The quality of the texts available on the net, from the lexical point of view, is various. Contents in the forums and blogs are not controlled by anyone. They are of very pour quality, sometimes due to author’s bad language skills, but predominantly due to informal writing style and orthography, which are implemented only here. However, online newspapers, journals, and particularly all the books published online are supposed to be checked for spelling and grammar. For that reason, online lexical corpora are perfect for validation of lexical resources.

The work presented in the paper deals with the lexical correctness of manually polished online texts available on the Internet, which are published in Macedonian.

The paper first introduces lexical resources that were used to validate lexical correctness. They consist of a basic dictionary extracted from printed dictionaries. We extended this basic dictionary with newly derived verbal adjectives, as well as with the most common toponyms, proper nouns and compound words. The lexicon has further been extended with all the inflections of dictionary entries, comprising negative adjectives and adverbs, prefixed verbs, diminutives and augmentatives.

Afterward, the corpus is presented. It was compiled from available Macedonian texts published on the Internet using UTF-8 encoding standard. Although limited to one encoding standard, the corpus consists of more than one million word forms, introducing more than different 80000 word forms. The corpus was subdivided into six smaller corpora: drama, essays, critics, poetry, prose and laws, each of them introducing at least 10000 word forms.

The analysis of lexical correctness of electronic corpus is in fact the coverage of all the word forms available in the corpus in the electronic lexicon, i.e. an intersection of the lexicon with the corpus itself. The coverage ranged between 72.98% for legislative texts to 80.70% for the prose. This coverage increases to 89.78% coverage for critics and 94.28% for the prose when the frequency of the words is taken into account.

The corpus introduced around 18000 unrecognized word forms. They were grouped into six groups: word forms which contain Latin letters, incorrect word forms, interjections, “dialectisms”, proper nouns, and finally, newly derived word forms. Interestingly, spelling errors appeared in around 6000 word forms, which is approximately equal to new word forms created in the electronic lexicon. It proves lexical correctness of online corpus, but also confirms the correctness of electronic lexicon.

Elena Petroska

^ About the so-called renarration in Macedonian and Bulgarian

The subject of interest is the so-called renarration category in Macedonian and Bulgarian. The goal is to present what terminology that is used in the Slavic and Balkan literature, and especially in the Macedonian and Bulgarian literature gives the best idea about this complex category and its meaning, and what are the differences between these languages in distribution of the forms they use to present these meanings.

It is well known that Macedonian and Bulgarian have a so-called renarration distinction traditionally described as based on the opposition witnessed/reported. This has been a topic for many scholars dealing with Macedonian, Bulgarian and Balkan linguistics. The theories and the terminology that have been used to better describe this distinction vary and include: the category of status, evidential category, marked for distance, marked as nonconfirmative, reported, admirative, dubitative etc.

Лидија Тантуровска

Директниот и индиректниот објект во македонскиот стандарден јазик наспрема споменативе објекти во некои други (словенски и несловенски) јазици

This research focuses on the definitions of direct and indirect object in Macedonian standard language vs. in the other languages.

When we talk about verb's valence, it is not possible to make sentence without one of the objects. These sentence elements sometimes are obligatory, so direct and indirect objects are fundamental element of sentence structure.

In Macedonian standard language, the direct object is noun phrase without a preposition, directly connected to the verb. It is directly affected by the verb's action.

The indirect object is noun phrase, indirectly connected to the verb. It functions as addressee or goal of the verb's action.

One of the most famous characteristics in the Macedonian standard language is that the both, direct and indirect objects could be coupled with the related short pronominal forms.

Some times the definitions of these objects are same in the other languages, but some times are different. We have tried to compare them in some languages (Slavic and other languages).

Zuzanna Topolinjska

^ The role of pragmatic and/or semantic factors in the evolution of Slavic nominal systems

F. V. Mareš in his opening contribution as a member of the Macedonian Academy of Sciences and Arts (cf. Das Verhältnis der Belebtheits- und Determinierungskategorie im Slavischen, MASA, Opening addresses, contributions and bibliography, 1982, 162-175) formulates an interesting thesis on the complementary distribution of the grammatical categories of definiteness and of animacy in Slavic languages. He argues that in the Northern Slavic languages all the markers of the referential definiteness, present in Old Church Slavonic texts, were successively eliminated and compensated by the development of the new category of animacy, while in the Balkan Slavic languages the category of definiteness was developed and consolidated. In order to prove his thesis Mareš carries a detailed analysis of the grammatical means functioning as exponents of the two complementary categories in particular Slavic languages. Finally, he looks for the cause of such a development and comes with the answer that it could be triggered by the “therapeutic” need to differentiate the nominative from the accusative in the old singular masculine paradigmes.

In my paper I would like to present some pragmatic and semantic factors relevant for the development of the two categories in question. I am focusing on the anthropocentric principle and its role in the evolution of Slavic nominal systems.

Станислава-Сташа Тофоска

^ Verbal Predicates Carriers Of Evidential Information in Macedonian and Polish

Evidential meanings specify the source of knowledge expressed in assertions, more precisely: they specify the source of knowledge by virtue of which the speaker feels entitled to make a statement. Any grammatical or lexical expression conventionally containing such a meaning component ought to be regarded as a marker of evidentiality. Macedonian is one of not many languages in which the category of evidentiality is regularly gramaticalized (by the sum-perfect), while in Polish lexical markers for evidentiality are predominant.

In this paper are disscussed some verbal predicates (verbs denoting speech acts and mental states) which carry (lexical) information about evidentiality in Macedonian and Polish. The aim of the paper is to point out these lexical expressions, to find out if and how the grammatical and lexical markers of evidentiality combine in the statements with these verbal predicates in Macedonian and to compare to the situation in Polish.


Я. И. Бьорнфлатен

^ Грамматикализация в исторической морфологии русского языка: “На халава вянцы надели”

Грамматикализация, им./ вин. мн. ч. площадя, галава

Известно, что в русском языке, в отличие от всех остальных славянских языков, окончание существительных на -á им./ вин. мн. ч. весьма широко распространено. Если оно в литературном языке охватывает большое количество существительных мужского рода, то оно в разговорном языке и в просторечии охватывает и существительные женского рода, типа площадь - площадя, мать - матеря. В некоторых говорах, прежде всего в южнорусских, окончание - á наблюдается даже в существительных второго склонения, например приехать в чужие земля, сунуть под голова.

В то время как формы типа площадя можно интерпретировать как результат пропорциональной аналогии:

дом - домá

площадь - Х Х = площадя

это едва ли возможно когда речь идет о формах типа голова - голова. В этом случае пропорциональная аналогия не применима:

дом - домá

головá - Х Х? = головá

На фоне исторического развития русского склонения и в свете теории грамматикализации предпринимается попытка объяснить формы им./ вин. падежей мн. ч. на -á.

