What the Language and Culture Atlas of Ashkenazic Jewry is:

Notes:
Foil 13 of 26

A corpus with its range of language occurrences is proportional to the full range of occurrences in the language itself. This point of view, and the problems connected with it, is gaining in importance with the growing scientific focus on the quantitative evaluation of corpora. The difficulties have already been discussed in the late sixties, considering reasonable corpus economies in the face of limited computational power. In the late seventies the concept of "representative corpora" was criticized (cf. B. Rieger: Repräsentativität. Von der Unangemessenheit eines Begriffs zur Kennzeichnung eines Problems linguistischer Korpusbildung, in: H. Bergenholz / B. Schäder: Empirische Textwissen-schaft, Königstein 1979, S. 52ff.), and later were developed the concept of "balance of the corpus" (vgl. z.B. J. Sinclair: Corpus, Concordance, Collocation, Oxford 1991, S. 13ff).
We follow our own approach. Every corpus has a property that we for now call the "degree of saturation". It means: using your corpus do the calculation for a random language occurrence; add another text to your corpus and repeat the calculation. The degree of saturation increases with the decrease in variation of your statistics. When the statistical results become invariant when expanding your corpus there is no further need to do so.
The property of corpora has been described also by the entropic theoreme of Shannon (cf. C. Shannon, W. Weaver: The Mathematical Theory of Communication, Urbana 1949. F. Bauer, G. Goos: Informatik, Eine einführende Übersicht, Berlin, New York, Heidelberg 1971) and can be widely used. E.g. it can replace the classical term "representative" which is hard to define work with . It also defines the minimum size for a corpus with which one can expect reasonable statistical results.
On the basis of this we regard the here defined term "virtual corpus" a new method that can be of good use for a statistic analysis of language.

Notizen:
Folie 13 von 26

Die Sprachvorkommnissen in einem einem Corpus sollen die Sprachvorkommnisse in der Sprache repräsentieren. Diese Forderung Ansicht zusammen mit ihren Problemen nimmt, durch vermehrtem wissenschaftlichem Focus auf die quantitative Evaluation der Corpora, an Bedeutung zu. Die Schwierigkeiten wurden bereits in den späten Sechzigern diskutiert, unter Berücksichtigung angemenssener Corpus Economies bedingt durch limitierte Computerpower. In den späten Siebzigern wurde das Konzept „Repräsentative Corpora“ kritisiert.. (cf. B. Rieger: Repräsentativität. Von der Unangemessenheit eines Begriffs zur Kennzeichnung eines Problems linguistischer Korpusbildung, in: H. Bergenholz / B. Schäder: Empirische Textwissen-schaft, Königstein 1979, S. 52ff.), and later were developed the concept of "balance of the corpus" (vgl. z.B. J. Sinclair: Corpus, Concordance, Collocation, Oxford 1991, S. 13ff).
Wir folgen unserem eigenen Weg. Jeder Corpus hat einen Umfang, den wir hier als „Grad der Sättigung“ bezeichnen. Das bedeutet: Während Sie Ihren Corpus benutzen werden zufällige Sprachvorkommen berechnet; fügen Sie weiteren Texts zum Corpus hinzu, und wiederholen Sie die Berechnung. Der Grad der Sättigung steigt mit dem Abnehmen von Variationen in Ihrer Statistik. Wenn trotz Ausdehnung des Corpus die statistischen Ergebnisse gleich bleiben, besteht für weitere Texteingaben keine Notwendigkeit mehr.
Der Umfang von Corpora wurde auch im Entropic Theoreme von Shannon (cf. C. Shannon, W. Weaver: The Mathematical Theory of Communication, Urbana 1949. F. Bauer, G. Goos: Informatik, Eine einführende Übersicht, Berlin, New York, Heidelberg 1971) beschrieben und kann vielfältig benutzt werden. Z.B. kann es den klassischen Ausdruck „repräsentativ“, mit dem es schwer ist Arbeit zu definieren, ersetzen. Es definiert auch die minimale Größe eines Corpus, mit der brauchbare statistische Werte erwartet werden können. Somit betrachten wir den hier definierten Terminus „virtual Corpus“als neue Methode, die für statistische Analysen einer Sprache wertvoll sein kann.