Corpus-informed application based on Korean Learners’ Corpus: substitution errors of topic and nominative markers
Asian-Pacific Journal of Second and Foreign Language Education volume 6, Article number: 13 (2021)
This study aims to demonstrate the need for learner-corpus-informed applications and proposes methods of application that promote the proper use of Korean topic and nominative markers. This study extracted 3004 pieces of error from the error-annotated corpus of the Korean Learners’ Corpus, the largest Korean learner corpus to date. A detailed examination of the above data was conducted to subdivide the types of substitution errors related to the topic and nominative markers, and to analyze the error rate according to the type of error and level of proficiency. The statistical data revealed no consistent correlation between the error rate and proficiency level. Furthermore, based on the proportion of error types by proficiency level, this study proposes the use of common mistake boxes with real errors; these errors are generally committed by learners of all proficiency levels and are not presumed problematic by grammarians or intuition-based teachers. These boxes can, therefore, be utilized as a practical tool for inclusion in pedagogical materials, such as learner’s dictionaries and textbooks.
The Korean language is an agglutinative language with markers attached to nouns to indicate the case. Therefore, understanding the use of case markers is critical to Korean language learners comprehending Korean sentence structures. According to statistics, based on the error-annotated corpus of texts by Korean language learners, which was released in 2020 by the National Institute of Korean Language, errors involving the use of nominative markers constituted the highest percentage of errors among those related to case markers. Furthermore, as the Korean language is a topic-oriented language (Li & Thompson, 1976), it is important that Korean language learners acquire the use of topic and nominative markers. However, it may be difficult for them to select and utilize the appropriate type of markers because, sometimes, a topic marker can be attached to the subject of a sentence, instead of a nominative one.
Regarding the above markers, the problem of substitution has been a long-term topic of interest in the field of Korean language education, and three types of analysis have been conducted in an effort to resolve the difficulties that language learners and teachers face as a result. First, a teaching method was presented using an intuition-based approach that was based on the analyses of the two markers that appeared in textbooks (Ahn, 2009; Kim & Nam, 2002). Second, individual researchers analyzed the substitution errors related to the two markers by using a small-scale collection of data that was built using questionnaires and texts written by Korean language learners (Jung, 2004; Park, 2010). Third, researchers used the Korean Learners’ Corpus (KLC) (2020), which had been in the process of being compiled, or a small-scale learner corpus they had built, to analyze the frequency of error patterns that occurred due to substitution, misformation, omission, and addition of grammatical markers (Jang, 2019; Kim, 2009). However, the first and second types of research used artificial data that were a result of highly controlled language tasks. Moreover, the issue of representativeness is a potential limitation, as the two types of studies utilized a small collection of data. Furthermore, the third type of research, the corpus-based studies, concluded after performing a quantitative analysis and did not conduct further research regarding the pedagogical applications of their findings.
Since the 1990s, it has been emphasized that the teaching of foreign languages should move beyond intuition-based teaching and research and instead be based on the analysis of errors committed by language learners, as demonstrated in qualified, large-scale learner corpora (Biber, Conrad, & Reppen, 1998; Chambers, 2015; Conrad, 2005; Flowerdew, 2012; Götz & Mukherjee, 2019; Granger, 1993, 2012, 2015; Meunier, 2002; Meunier & Reppen, 2015; Mindt, 1996; Nesselhauf, 2004). The field of learner corpus research is located in the intersection of corpus linguistics, second language learning, and foreign language teaching (Boulton, 2017; Le Bruyn & Paquot, 2021; Rankin, 2015; Vyatkina & Boulton, 2017). The findings of learner corpus analysis can provide teachers and language learners with a more effective form of language education by focusing on the grammatical forms and structures that learners find the most difficult. For example, Mindt (1996) examined the modal verbs, future time orientation, and conditional clauses that appeared in the native corpora of English and German textbooks that taught English as a foreign language. As a result, it was revealed that the grading of these grammatical items in textbooks was inconsistent with their use in corpus data. Accordingly, Mindt asserted that the use of corpus-based descriptions contributes to the effectiveness of foreign language education and the compilation of textbooks that include the use of actual English. However, “pedagogical ‘implications’ are much more numerous than ‘applications’ […] Learner corpus researchers should do more than point to some vague pedagogical implications” (Granger, 2015, p. 507). Therefore, a balance must be found between the frequency, difficulty, and pedagogical relevance of target grammatical items for the education of foreign languages. To strike a balance between these three elements, learner corpus research is a necessity (Meunier, 2002).
In May 2020, the National Institute of Korean Language introduced the KLC, a large-scale corpus that would have been difficult for individual researchers to build. The KLC is the most extensive Korean learner corpus to date, with 3,784,091 words in the raw corpus. In addition, the KLC includes an error annotated corpus that provides the basis to observe the characteristics of errors committed by Korean language learners.
Based on the extensive KLC, this study aims to demonstrate the need for learner-corpus-informed applications and proposes methods of application that promote the proper use of Korean topic and nominative markers by addressing the following research questions:
Is there a correlation between the error rate and proficiency level regarding the use of Korean topic and nominative markers?
How does the learner corpus-based description differ from the existing intuition based one?
How can the corpus-based empirical findings apply to pedagogical materials?
To investigate these questions, the substitution errors of topic and nominative markers are extracted from the KLC. In contrast to previous studies, the present study does not merely depend on error annotated data or stop with demonstrating the frequency of substitution error patterns for topic and nominative markers. Instead, in the Results section, it presents the analysis of the correlation between the error rate and proficiency level of language learners, according to the (sub-)categorization of substitution errors of topic and nominative markers. In the Discussion section, based on the results of the statistical analysis, this paper proposes the use of common mistake boxes that could be applied to pedagogical materials.
Difficulty of correct usage of the topic and nominative markers
The following two sentences extracted from the KLC demonstrate the unavoidable difficulty that Korean language learners face in their selection of topic -n(un) and nominative markers -i/ka.Footnote 1 The underlined constituents of (1a) and (1b) are both subjects, and each shows a substitution error.
1a. substitution of a nominative marker for a topic marker
(sample number 1549)
“My friend likes k-pop”
1b. substitution of a topic marker for a nominative marker
(sample number 5592)
a few days ago
“A few days ago, a terrible accident happened”
Regarding (1a), the context prior to this sentence refers to the relationship between the “friend” and speaker of this sentence. Moreover, this sentence is also talking about the same friend. Accordingly, the “friend” naturally becomes the subject of the sentence in addition to being the topic. However, when a nominative marker is affixed to this constituent, the meaning becomes inconsistent with the implied context of the sentence. This is because a nominative marker basically indicates the focus of a sentence (Jun, 2015). Contrary to the sentence in (1a), that in (1b) is the opening sentence of a conversation. Therefore, a topic marker cannot be affixed to the subject “terrible accident.” On the other hand, the use of a nominative marker is appropriate when speaking the sentence in (1b), as it indicates the focus from the perspective of information structure.
In summary, substitution errors of the two markers are a common problem for language learners for two main reasons. First, a nominative marker reflects the grammatical function (subject) and the information structure (focus). Second, besides generally indicating the topic of a sentence, a topic marker can also be attached to a subject constituent in place of a nominative marker. Thus, to use these two markers accurately, learners need to understand the sentence structure as well as the context.
Description of the topic and nominative in pedagogical materials
In this section, we will succinctly examine how textbooks, pedagogical grammar reference books, and learner’s dictionaries explain the use of topic and nominative markers. Let us begin by examining four textbooks: Ewha Korean (2011), Sejong Korean (2019), Seogang Korean (2015), and Yonsei Korean (2013). These textbooks shared two similarities. First, while Korean language textbooks are generally categorized into six levels, from Level 1 (beginner) to Level 6 (advanced), according to the proficiency level of learners, the grammatical description of the two markers only appeared in the Level 1. Second, the explanation of topic markers appeared earlier in the textbooks than that of nominative markers.Footnote 2 However, while some studies have shown that the occurrence of case-marking errors for topic markers decreases as language learners advance to the intermediate and advanced proficiency levels (Kim, 2009; Ko et al., 2004), other studies have determined that the rate of occurrence increases as learners progress from the intermediate to advanced level (Kim & Nam, 2002; Lee, 2002). Similarly, while a study has found that the error rate for the use of nominative markers decreases as language learners progress from the beginner to advanced level (Ko, 2002), research has also shown that the substitution error rate increases when learners are at the intermediate level than when they are at the beginner level (Kim & Nam, 2002). These findings demonstrate that adequate explanations regarding the use of these two markers are needed for textbooks of every level, in addition to Level 1.
Next, a review of two learner’s dictionaries, the Learner’s Dictionary of Korean (2008) and Korean Learners’ Dictionary (2020), and three grammar reference books, Baek (1999), National Institute of Korean Language (2005), and Lee and Lee (2006) revealed that an intuition-based approach was being used to arrange the diverse uses of topic and nominative markers. For example, in the Korean Learners’ Dictionary (2020), which was published by the National Institute of Korean Language, the uses of topic and nominative markers are described in the following way (Fig. 1 and 2).
In this dictionary, topic and nominative markers are divided into three senses and described accordingly. The division of these markers into senses, the ordering of these senses, and the relevant definitions included in this dictionary are consistent with those in the Standard Korean Dictionary (2020), which was published for native Korean speakers by the same publisher, the National Institute of Korean Language. That is to say, this method of describing the two markers does not significantly differ from that of traditional lexicographers or grammarians, who describe based on their intuition or knowledge. Pedagogical materials for language learners must closely analyze which areas learners realistically have difficulty with and demonstrate them. It is crucial that these materials demonstrate which of the diverse uses of topic and nominative markers must be selected and taught to learners at appropriate levels, according to the importance of use. Moreover, it would be greatly beneficial to learners if pedagogical materials compared related grammatical items, such as topic and nominative markers, and presented their differences.
To achieve this, researchers must move beyond the traditional approach, analyze high-quality learner corpus data, and actively use the findings for the development of reference and instructional materials. It is expected that a thorough analysis and utilization of learner corpus data would provide detailed help regarding which grammatical items to select, and how they should be described and presented (Granger, 2015; McEnery et al., 2019; Meunier, 2021).
Learner-corpus-informed approach and its pedagogical application
Learner corpus research began to emerge as a new field of research in the late 1980s. The research primarily focused on English as the target language, and the representative corpora include the International Corpus of Learner English and Cambridge Learner Corpus. Also, as illustrated by the diagram below, it is possible to use a learner corpus to create learner’s dictionaries, textbooks, and pedagogical grammars that reflect the errors committed by language learners (Fig. 3).
As the multilingual population continues to grow, the learner corpora of diverse languages are developing, alongside the English language learner corpora. In Korea, there has also been an increase in learner corpora since the 2000s. To date, besides the KLC, the following learner corpora have officially been published: the Yonsei Learner’s Corpora (2002) and Korean University Learner Corpus (2006), which contain about 500,000 words each. Many previous studies have used learner corpora such as these to examine the substitution errors involving topic and nominative markers (Han, 2016; Jang, 2019; Kim, 2009, among others). However, it was found that, while previous studies demonstrated the aspects of frequency and difficulty, they did not examine the aspect of pedagogical relevance, especially with regard to the applications.
Next, we examine existing grammar reference materials based on small-scale and native corpora. Lee and Lee (2006) is not different from dictionaries in general, apart from the fact that the former included examples from small-scale learner corpora. The Learner’s Dictionary of Korean (2008) is based on native corpora. While the purpose of this dictionary was to provide a list of widely used terms, the selection of vocabulary words was not based on the words that appeared in a corpus with the highest frequency; instead, it was based on words selected by Korean language education experts or commonly determined to be important by existing learner’s dictionaries of the Korean language. Simply put, although it is a corpus-informed dictionary, it does not contain sufficient information that could be obtained from a corpus.
An issue could be raised regarding the discrepancy between the contents of intuition-based grammars and the types of errors observed in learner corpora. For example, Tognini-Bonelli (2001) observed that almost 50% of the occurrences of any were not consistent with the relevant explanations in pedagogical grammars. Similarly, Biber et al. (1998) found that several English language textbooks did not describe the discourse function of that-clauses, which are placed in the subject position. Meanwhile, the results of corpus analysis revealed that that-clauses appear in the subject position under certain conditions. These examples show that it is necessary to develop pedagogical materials based on corpus-based information, instead of the traditional intuition-based approach.
The corpus: Korean learners’ Corpus
The KLC is a large-scale learner corpus that was constructed as a government-led project, and was funded by the Ministry of Culture, Sports and Tourism; the process took approximately 5 years, from 2015 to 2019.Footnote 3 The KLC was created with 26,152 samples of 93 language groups and 142 countries, and it was composed of raw (3,784,091 words from 26,152 samples), morph-tagged (2,629,261 words from 18,521 samples), and error-annotated (793,374 words from 4903 samples) corpora that included samples of both the written and spoken language.Footnote 4 The tests for spoken and written data were conducted according to the level of proficiency for the duration of 60 weeks. The written and spoken data were obtained from written compositions, and presentations and interviews that lasted for five to 10 min, respectively.
In the case of the error-annotated corpus, detailed annotation statistics are presented by proficiency level and language group, so that users could utilize the corpus as a reference material based on their objectives. Error tags are categorized to indicate error forms, patterns, and levels. Of the groups of lexical and grammatical items, the concept of error forms in this study refers to topic and nominative markers. Error levels are divided into categories of pronunciation, form, syntax, and discourse, and the annotation of errors regarding pronunciation are limited to spoken data. Error patterns compare error and corrected items to describe errors of omission, addition, substitution, and misformation. Of the categories of error levels and patterns, this study focuses on those of form and substitution, respectively. The error of substitution occurs when the meaning and function of grammatical markers are not sufficiently understood, and of the different types of grammatical marker errors, the error of substitution occurs the most frequently.
Data collection and analysis procedure
This study extracted a total of 3246 items of data regarding the errors that occurred when topic markers were substituted for nominative markers and vice versa, from the KLC’s error annotated corpus. However, instead of using the extracted data as is, this study put the data through a process of data cleansing, because certain items were repeated and erroneously analyzed. Accordingly, 242 items of data repetition and erroneous analysis were deleted (see Table 1). As a result, this study began analysis with a total of 3004 error items.
Based on the 3004 error items that had undergone the data cleansing process, this study carried out a (sub-)categorization process for substitution errors, or put differently, conducted a qualitative analysis with a bottom-up approach.Footnote 5 This process of categorization is not based on the content of existing grammars and textbooks; instead, this categorization of errors is solely based on the classification of the aforementioned 3004 substitution errors. Therefore, it was possible to determine the types of substitution errors as shown in Tables 4 and 5 by indicating the cause of the learner’s error for each error item, and then grouping the various causes of errors, without first presuming the error subtypes. It should be noted that the error-annotated corpus of KLC presents the context of each error item, providing permission to determine their information structure.
Regarding the revised forms of topic markers, it was possible to observe the subtypes of two larger groups of substitution errors, the topic and contrast categories. An examination demonstrated that there was a higher frequency of errors belonging to the topic category than the contrast category.
The revised forms of nominative markers were grouped into three larger categories; type 3, in particular, was classified into four subtypes.
The next chapter will conduct a statistical analysis that was based on the qualitative analysis of error categorizations.
This study examines two error rates to determine the relationship between learners’ level of proficiency (L1-L6) and the types of errors, and determines the error tendencies of learners. The first error rate pertains to the proportion of error types based on the learners’ level of proficiency, while the second error rate indicates the proportion of proficiency levels based on the type of error that was committed by learners. The results of the analysis are as follows.
Substitution error of the nominative marker for the topic marker
First, we examine cases in which learners incorrectly use nominative markers as topic markers. In Table 4, the numbers indicate the types of errors and the alphabet letters indicate the subtypes. Each level is described as “L number,” such as L1 for level 1.
Proportion of error types by proficiency level
The Fig. 4 revealed that the error rate of error type 1 (topic) was higher than that of error type 2 (contrast) at all levels of proficiency. While the error rate of L1 was highest and that of L6 was lowest for error type 1, it was vice versa for error type 2.
In Fig. 5, the error rate of 1C (subject as a topic) was the highest at all levels of proficiency. Figure 6 revealed that subtype 2B (subject as a contrast) was difficult at most of the proficiency levels when compared to subtype 2A (constituent, other than subject, as a contrast) and 2C (subject in contrastive focus constructions).
Proportion of proficiency levels by error type
In Fig. 7, the error rate of error type 1 (topic) was highest at L3 and decreased as it approached L6. The error rate of error type 2 (contrast) increased overall, according to subsequent levels of proficiency. The correlation between the proficiency level and the error rate for the topic marker is not instantly recognizable.
In Fig. 8, error subtype 1A (subject of an introductory statement) displayed a significantly marked decrease in error rate, with increasing levels of proficiency. Meanwhile, compared to 1A, graphs 1B ~ 1D do not show unidirectional increase or decrease flection, according to proficiency levels.
As described in Fig. 9, in the case of error type 2, 2A (constituent, other than subject, as a contrast) exhibited characteristics that were contrary to those of 1A. In other words, the error rate of 2A rose with increases in the level of proficiency. Meanwhile, subtype 2B (subject as a contrast) had the highest error rate at L4, and showed subsequent decrease from L5. Subtype 2C (subject in contrastive focus constructions) displayed the lowest error rate at L5, with an increased rate at L6 section. Error subtypes 2B and 2C do not, therefore, illustrate consistent relationship between the proficiency level and the error rate.
Substitution error of the topic marker for the nominative marker
Next is the examination of cases in which learners incorrectly use topic markers as nominative markers. In Table 5, the numbers indicate the types of errors and the alphabet letters indicate the subtypes.
Proportion of error types by proficiency level
Examination of the types of error by proficiency level revealed that the proportion of errors related to error type 1 (subject as a focus) was highest at L1, but the proportion of errors related to error type 3 was highest for the remaining proficiency levels, as illustrated in Fig. 10. Of the subtypes of error type 3, almost every proficiency level had difficulty in 3B (subject in an adnominal embedded clause) and 3C (subject in an adverbial embedded clause) (Fig. 11).
Proportion of proficiency levels by error type
Let us observe Fig. 12. In the case of error types 1 and 2, there was a decline-rise-decline-rise-decline in the error rates for proficiency levels L1 to L6, while the error rate of error type 3 was shown to gradually increase with subsequent levels of proficiency. Figure 13 revealed that the subtypes of type 3 errors showed an increase from the intermediate to the advanced levels, except 3C, where there was a decline after the intermediate levels. Ultimately, the graphs depicting the proficiency levels by error type and vice versa demonstrate that increases in the proficiency level do not signify a decrease in the error rate of nominative markers. These findings imply that learners must continuously be educated with regards to the use of these case markers, up to the advanced levels of proficiency.
The above graphs depicted in Figs. 7, 8 and 9 and Figs. 12 and 13 illustrate that the rates of error do not decrease with increases in the level of proficiency. This study performed a test of proportions to determine whether there was a difference in error rates for particular types of error, according to proficiency levels; the p-values depicted below were obtained as a result (see Tables 6 and 7). The bolded sections indicate the difference in proficiency levels at which the null hypothesis was rejected, at a significance level of 0.05.
The results of the proportions test demonstrated that in the case of topic markers, when compared to that of error type 1, the error rate of error type 2 did not decrease when the proficiency level increased. For subtype 1A of error type 1, the relationship between the proficiency level and error rate for L3-L6 could not be defined conclusively. There was also no relationship between the error rate and proficiency level for subtypes 1B and 1D, and 2A and 2B; this was highlighted by the fact that the null hypothesis was not rejected for comparisons regarding all levels of proficiency.
With regard to nominative markers, in the case of error type 1, the error rate of L1 was higher than it was for other proficiency levels; in the case of error type 2, the error rate of L6 was lower when compared to that of L1, L3, and L5. Regarding the subtypes A and D of error type 3, the error rate could not be considered to be high for lower levels of proficiency. However, for subtype C of the same error type, the error rate of L6 could be considered to be low, as the null hypothesis was rejected when L6 was compared to other levels of proficiency.
Discussion: from implication to application
This study examined the tendencies of substitution errors of topic and nominative markers by utilizing the error-annotated corpus of KLC. Relevant data regarding the substitution errors of topic and nominative markers were selected and a process of categorization was performed with a total of 3004 error items. Based on the categorization process, this study examined the error rates by proficiency level and by error type before conducting a test of proportions. This section presents the implications and applications for the learning and teaching of the Korean language.
Pedagogical implications of the findings
There are two main pedagogical implications of this study. First, as emphasized by Biber et al. (1998), Mindt (1996), and Tognini-Bonelli (2001), it is possible to examine the differences between the existing intuition-based and learner corpus-based descriptions.Footnote 6 Topic markers can be affixed to grammatical forms, such as objects and adverbs, in addition to being attached to subjects. The error rate regarding this matter was found to be 17.8%. Although this error rate is lower than the error rate (56.7%) of topic markers that are affixed to subjects, it demonstrates that it is necessary to teach the combination of topic markers with diverse constituents. In the case of nominative markers, learners are taught that this marker is attached to a grammatical subject constituent of simple sentences. However, performing an analysis of corpus data demonstrated that nominative marker errors, in fact, occur the most frequently within embedded clauses, and not within main clauses. In other words, it is expected that the teaching and learning of the Korean language will occur more effectively through pedagogical materials, including subdivided error types and rates.
Second, the data that was obtained by reviewing the significance of proportions based on error types and proficiency levels could provide more detailed guidelines and methods for teaching and learning a foreign language. The errors are subdivided into several subtypes, and teachers can utilize them by connecting them to the proficiency level of learners. In the case of topic markers, an examination of the proportion of proficiency levels by error type demonstrated that the error rate continued to rise from L1 to L6 for error type 2 (contrast). Meanwhile, in the case of the error type 1 (topic), it was shown that the error rate falls significantly as the proficiency level progresses from L1 to L6. In addition, it was determined that there was no correlation between the error rates and proficiency levels for subtypes 1B (constituent, other than subject, as a topic) and 1D (subject of a general factual statement) of error type 1, and the subtypes 2A (constituent, other than subject, as a contrast) and 2B (subject as a contrast) of error type 2. For error types in which a correlation could not be found, an emphasis must be placed on the necessity of continuous learning. In the case of nominative markers, the resulting data for L1 demonstrated that the subject of a sentence does not automatically take a nominative marker in comparison with introductory statements (i.e., subtype 1A of topic marker errors). Additionally, the same data showed that continuous systematic teaching and learning of nominative markers is necessary, regarding the diverse embedded clauses that are taught to learners according to their levels of proficiency, from levels L2 to L6. These findings could be utilized to create pedagogical materials, such as worksheets that contain content based on the target proficiency level and error subtype. Furthermore, the findings could be used in a teacher’s syllabus, to provide guidelines regarding the areas to pay attention to per level of proficiency.
Learner-corpus-informed application: description of common mistake boxes
While previous studies of Korean language teaching and learning concluded with proposing the pedagogical implications of their analysis, this study presents common mistake boxes that focus on the grammatical items that have had consistently high error rates for all proficiency levels, according to the analysis of errors. From the late 1990s, English language learner’s dictionaries utilized learner corpora and presented English language learners with the applied information. For example, information regarding common errors (Longman Dictionary of Common Errors, 1996), common mistakes (Cambridge Advanced Learner’s Dictionary, 2013), help boxes (Longman Essential Activator, 1997), and frequency bands (Collins COBUILD English Dictionary for Advanced Learners, 2001) were used to present the grammatical errors, spelling errors, collocational errors, and etc. that learners must pay attention to. This study reviews the descriptions of errors from various learner’s dictionaries and proposes the use of common mistake boxes.
The proficiency level of Korean language learners is generally categorized into six levels, from L1 (beginner) to L6 (advanced), and this categorization is reflected in the textbooks or language materials. It could be presumed that the error rates of nominative markers in embedded clauses are high for the intermediate and advanced levels because learners are taught various conjunctive endings for embedded clauses as their level of proficiency increases. In other words, the appearance of particular grammatical forms, and the increased usage rate of these forms are affected by the timing of when they are taught to learners (Granger, 2015; Swan, 2005). This indicates that the appearance and increased use of grammatical forms may not signify the development of learners’ grammatical abilities. In this study, the data regarding the proportion of error types by proficiency level allowed us to determine the areas that L1-L6 learners have the most difficulty. The common mistake boxes presented by this study are based on this finding, and thereby focus on the types of errors that are committed by learners of all proficiency levels. Moreover, this study does not limit its analysis to certain language groups. Accordingly, the findings were able to determine which errors generally appeared among Korean language learners for all proficiency levels, without distinguishing the language groups of the learners.
Now, we describe the common mistake boxes that we propose. The title of each common mistake box reflects the most frequently occurring error type that was selected after examining the errors committed by learners. This process of selection was possible because the subtypes of errors were categorized; this was included in this study’s data analysis procedure. The content of common mistake boxes may appear similar to the content in existing intuition-based works. However, the examples used in the boxes are real errors; there is a distinction, in that the intuition-based approach tends to focus on the errors that grammarians presume problematic, rather than the errors that appear as a result of statistically analyzing learner corpora. The errors extracted from learner corpora represent the realistic difficulties that learners face and not the potential difficulties that they may face. The common mistake boxes are composed of the following content (Fig. 14).
Example of the common mistake boxes for the topic marker
In the case of topic markers, subtype 1C (subject as a topic) of error type 1 and subtype 2B (subject as a contrast) of error type 2 displayed high error rates for all levels of proficiency (see Figs. 5 and 6). Thus, as shown below, these subtypes provide highly useful data on examples of common mistakes. Let us describe the common mistake box related to subtype 1C. All levels of proficiency do not correctly use the topic marker attached to the subject. This use of the topic marker can be connected to the “Common mistake [focus and subject]” box (see Fig. 15), which illustrates the impossibility of the use of the topic marker. Thus, the box was marked “Compare with” to help learners better understand the relevant errors.
In the case of error type 2, subtype 2B was observed to be the most common error (see Fig. 6). Explanations were also included for the characteristics of sentences (conjunctions, conjunctive endings) that displayed errors in the corpus, to help lower the error rate of learners (Fig. 16).
Example of the common mistake boxes for the nominative marker
Regarding nominative markers, error type 3 represents the type of errors that occur commonly (see Fig. 10). In other words, learners have more difficulty in using the nominative marker than the topic marker, within embedded clauses. Accordingly, the theme of nominative markers and embedded clauses was used to compose a common mistake box (Fig. 17).
Although, of the three types of errors, error type 1 is not common, the common mistake box below is additionally presented because L1 displays a prominently higher error rate than other levels (see Fig. 10). The use of nominative markers can be connected to the errors that appear in the “Common mistake [topic and subject]” box (see Fig. 18), which does not permit the use of the nominative marker (Fig. 18).
Practical application of the common mistake boxes for the topic marker
Of the common mistake boxes presented above, this study will examine the practical applications for topic markers. As mentioned in Granger (2015), constructing a learner’s dictionary based on a large-scale learner corpus demands considerable effort as well as time. KLC was revealed in May 2020, which is why, instead of constructing a new dictionary with common mistake boxes above, we propose applying them to existing learner’s dictionaries and pedagogical materials, particularly the Korean-English Learners’ Dictionary described in the Literature review section. As mentioned earlier, the content of this learner’s dictionary is not greatly different from the Standard Korean Dictionary, which was published for native Korean speakers. However, as depicted by Fig. 19, this learner’s dictionary could be a useful material for teachers as well as learners, in the application of common mistake boxes.
The common mistake boxes for topic markers could effectively be applied to the first and second sense of topic marker descriptions of the Korean-English Learners’ Dictionary. The first sense indicates a contrast; a common mistake box was presented regarding the theme of [contrast and topic marker]. The second sense indicates the topic of a sentence; a common mistake box was presented regarding the theme of [topic and subject]. This box included an explanation for errors regarding the use of subjects as a focus and as a topic, which constituted the largest number of errors committed by learners. The explanation could be used to compare the uses of topic and nominative markers.
As the Korean-English Learners’ Dictionary in Fig. 19 is a web-based dictionary, icons of the common mistake boxes could be created, and pop-up windows could be utilized with the click of a mouse.Footnote 7
Based on the extensive KLC, this study observed the substitution errors of topic and nominative markers that Korean language learners find difficult. It then examined the relationship between the types of errors and the proficiency levels of learners by utilizing the frequency of errors, considering a balance between the frequency, difficulty, and pedagogical relevance of grammatical items. In the process of analyzing the errors committed by learners, this study was able to extract the subtypes of each case marker’s uses that were even more subdivided than those found in intuition-based materials. Furthermore, it determined that the tendencies of the relationship between proficiency levels and type of error were not consistent. Based on these findings, this study presented common mistake boxes for the types of errors that exhibit high error rates for all levels; then, these boxes were applied to pedagogical materials. This process demonstrated the necessity of moving beyond the intuition-based approach, in which grammarians or teachers present errors that they predict will occur, and elevated the significance of employing learner-corpus-informed applications. In particular, as the corpus utilized in this study is a large-scale corpus, the findings of this study have more usefulness and reliability than the results of studies that were based on the intuition-based approach and on small-scale corpora (Granger, 2015).
As the National Institute of Korean Language revealed the KLC, the necessity of the application of errors in language teaching and research was emphasized. Similar to how the present study used the substitution errors of topic and nominative markers to present relevant common mistake boxes, it will be helpful to create common mistake boxes of other grammatical items, such as verbal endings. Another line of research worth pursuing further is observing error patterns from an interlanguage perspective. The error annotated corpus of KLC describes the error forms of 93 language groups of Korean language learners. Errors committed by Korean learners of Japanese, Chinese, and English groups have been explored in depth; further research using extensive error data will expand the understanding of the error patterns of particular grammatical items or structures considering the mother-tongue background of Korean language learners.
Availability of data and materials
The datasets analyzed in the current study are available from the corresponding author upon reasonable request.
These markers have variants: the word that ends with a vowel + − nun (top) or -ka (nom), and the word that ends with a consonant + − un (top) or -i (nom).
This seems to be because, with the Korean language, a speaker generally uses a topic marker instead of a nominative marker to introduce information, such as their name and nationality.
In 2002, a learner corpus of 500,000 words was constructed in a project that was led by the Ministry of Culture, Sports and Tourism, but the corpus could not actually be distributed and used, due to copyright issues. For the use of the KLC, data was collected after obtaining the consent of learners according to IRB regulations with regard to the provision and utilization of data. The government-led KLC is a source of public data; thus, it is considered to be a balanced learner corpus of the Korean language that can be freely distributed and widely used.
While it could be said that KLC is small when compared to The Cambridge Learner Corpus, which contains up to 50 million words, the KLC is a comparatively large-scale corpus when compared to multiple Korean learner corpora.
The KLC’s error annotated corpus includes information regarding the language group of learners in addition to their proficiency level. Therefore, the data analysis procedure also included the statistical analysis of language groups (Tables 2 and 3). The language groups that displayed the highest percentages of topic and nominative marker substitution errors were the following: Chinese, English, and Vietnamese.
As this study focuses on the relationship between the types of errors and the proficiency level of learners, error rate data regarding the language groups are introduced as reference items.
For existing printed dictionaries, textbooks, or grammars, common mistake boxes could be utilized in a supplementary appendix.
Korean Learners’ Corpus
Ahn, Y. (2009). The meaning and the functions of ‘-i/ga’ and ‘-eun/neun’ in Korean textbooks for foreigners and their effective teaching methods [Master’s thesis]. Graduate School of Education. Hanyang University.
Baek, B. (1999). Korean grammar dictionary. Yonsei University Press.
Biber, D., Conrad, S., & Reppen, R. (1998). Corpus linguistics: investigating language structure and use. Cambridge University Press. https://doi.org/10.1017/CBO9780511804489.
Boulton, A. (2017). Research timeline: Corpora in language teaching and learning. Language Teaching, 50(4), 483–506. https://doi.org/10.1017/S0261444817000167.
Cambridge Advanced Learner’s Dictionary. (2013). Cambridge University Press.
Chambers, A. (2015). The learner corpus as a pedagogic corpus. In S. Granger, G. Gilquin, & F. Meunier (Eds.), The Cambridge handbook of learner corpus research, (pp. 445–464). Cambridge University Press. https://doi.org/10.1017/CBO9781139649414.020.
Collins COBUILD English Dictionary for Advanced Learners. (2001). HarperCollins Publishers.
Conrad, S. (2005). Corpus linguistics and L2 teaching. In E. Hinkel (Ed.), Handbook of research in second language teaching and learning, (pp. 417–434). Lawrence Erlbaum Associates.
Ewha Korean (2011). Level 1–6, Ewha Korean Language Center. Ewha Womans University.
Flowerdew, L. (2012). Corpora and language education. Palgrave Macmillan. https://doi.org/10.1057/9780230355569.
Götz, S., & Mukherjee, J. (2019). Learner corpora and language teaching. John Benjamins Publishing Company. https://doi.org/10.1075/scl.92.
Granger, S. (1993). The international corpus of learner English. In J. Aarts, P. de Haan, & N. Oostdijk (Eds.), English language corpora: design, analysis and exploitation, (pp. 57–69). Rodopi.
Granger, S. (2012). Error-tagged learner corpora and CALL: a promising synergy. CALICO, 20(3), 465–480.
Granger, S. (2015). The contribution of learner corpora to reference and instructional materials design. In S. Granger, G. Gilquin, & F. Meunier (Eds.), The Cambridge handbook of learner corpus research, (pp. 485–510). Cambridge University Press. https://doi.org/10.1017/CBO9781139649414.022.
Han, S. (2016). A study on the errors and use of the particle ‘un/nun’ by Korean learners: Focusing on their Korean proficiency and their native languages. Korean Language and Literature, 70, 111–151. https://doi.org/10.22784/eomun.2016.70.111.
Jang, S. (2019). A quantitative study on the particle errors of English-speaking Korean learners. Using <Korean Learners’ Corpus> by National Institute of Korean Language. Language Facts and Perspectives, 47, 25–57. https://doi.org/10.20988/lfp.2019.47.25.
Johansson, S. (2009). Some thoughts on corpora and second-language acquisition. In K. Aijmer (Ed.), Corpora and language teaching, (pp. 33–44). John Benjamins. https://doi.org/10.1075/scl.33.05joh.
Jun, Y. (2015). Focus, topic, and contrast. In L. Brown, & J. Yeon (Eds.), The handbook of Korean linguistics, (pp. 179–195). Wiley-Blackwell. https://doi.org/10.1002/9781118371008.ch10.
Jung, E. (2004). Topic and subject prominence in interlanguage development. Language Learning, 54(4), 713–738. https://doi.org/10.1111/j.1467-9922.2004.00284.x.
Kim, C., & Nam, K. (2002). An analysis of error in using Korean particles by native speakers of English. Korean Language Education, 13(1), 27–45 http://www.kci.go.kr/kciportal/landing/article.kci?arti_id=ART000868726.
Kim, J. (2009). Replacement error analysis of Korean learner in using the particles “i/ga” and “eun/neun”. Journal of Linguistic Science, 48, 1–40 http://www.kci.go.kr/kciportal/landing/article.kci?arti_id=ART001333971.
Ko, S. (2002). Error analysis of postposition in learner’s corpus. Teaching Korean as a Foreign Language, 27, 543–570 https://www.kci.go.kr/kciportal/ci/sereArticleSearch/ciSereArtiView.kci?sereArticleSearchBean.artiId=ART001348494.
Ko, S., Kim, M., Kim, J., Seo, S., Jeong, H., & Han, S. (2004). Korean learner’s corpus and error analysis. Seoul: Hankuk munwhasa.
Korea University Learner Corpus (2006). Korean language center. Korea University.
Korean Learners’ Corpus. (2020). National Institute of Korean Language. https://kcorpus.korean.go.kr/service/goSummaryStatus.do Accessed 22 Feb 2021.
Korean Learners’ Dictionary. (2020) National Institute of Korean Language, https://krdict.korean.go.kr/mainAction Accessed 22 Feb 2021.
Le Bruyn, B., & Paquot, M. (2021). Learner Corpus research meets second language acquisition. Cambridge University Press.
Learner’s Dictionary of Korean (2008). National Institute of Korean Language. Sinwonprime.
Lee, H., & Lee, J. (2006). Dictionary of postpositional particles for learners. Hangukmunhwasa.
Lee, J. (2002). A research study on language production errors of Korean language learners [PhD Thesis]. Kyung Hee University.
Li, C., & Thompson, S. A. (1976). Subject and topic: a new typology of language. In C. Li (Ed.), Subject and topic, (pp. 457–489). Academic Press.
Longman Dictionary of Common Errors (1996). Pearson Education.
Longman Essential Activator. (1997). Pearson ESL.
McEnery, T., Brezina, V., Gablasova, D., & Banerjee, J. (2019). Corpus linguistics, learner corpora, and SLA: Employing technology to analyze language use. Annual Review of Applied Linguistics, 39, 74–92. https://doi.org/10.1017/S0267190519000096.
Meunier, F. (2002). The pedagogical value of native and learner corpora in EFL grammar teaching. In S. Granger, J. Hung, & S. Petch-Tyson (Eds.), Computer learner corpora, second language acquisition and foreign language teaching, (pp. 119–141). Benjamins.
Meunier, F. (2021). Introduction to learner Corpus research. In N. Tracy, & M. Paquot (Eds.), The Routledge handbook of second language acquisition and corpora, (pp. 23–36). Routledge.
Meunier, F., & Reppen, R. (2015). Corpus- versus non-corpus-informed pedagogical materials: Focus on grammar. In D. Biber, & R. Reppen (Eds.), The Cambridge handbook of English corpus linguistics, (pp. 498–514). Cambridge University Press. https://doi.org/10.1017/CBO9781139764377.028.
Mindt, D. (1996). English corpus linguistics and the foreign language teaching syllabus. In J. Thomas, & M. Short (Eds.), Using corpora for language research, (pp. 232–247). Longman.
National Institute of Korean Language (2005). Korean grammar for foreign learners. Communication Books.
Nesselhauf, N. (2004). Learner corpora and their potential for language teaching. In J. Sinclair (Ed.), How to use corpora in language teaching, (pp. 125–156). John Benjamins. https://doi.org/10.1075/scl.12.11nes.
Park, H. (2010). The L2 acquisition of the Korean double nominative construction by Chinese speakers. Korean Journal of Linguistics, 35(3), 635–658. https://doi.org/10.18855/lisoko.2010.35.3.006.
Rankin, T. (2015). Learner corpora and grammar. In S. Granger, G. Gilquin, & F. Meunier (Eds.), The Cambridge handbook of learner corpus research, (pp. 231–254). Cambridge University Press. https://doi.org/10.1017/CBO9781139649414.011.
Sejong Korean (2019). Level 1–8, King Sejong Institute Foundation. National Institute of Korean Language.
Seogang Korean (2015). Level 1–6, Korean Language Education Center. Seogang University.
Standard Korean Dictionary. (2020) National Institute of Korean Language, https://stdict.korean.go.kr/main/main.do Accessed 22 Feb 2021.
Swan, M. (2005). Practical English usage. Oxford University Press.
Tognini-Bonelli, E. (2001). Corpus linguistics at work. John Benjamins. https://doi.org/10.1075/scl.6.
Vyatkina, N., & Boulton, A. (2017). Corpora in language learning and teaching. Language & Technology, 21(3), 1–8 https://scholarspace.manoa.hawaii.edu/bitstream/10125/44750/1/21_03_commentary.pdf.
Yonsei Korean (2013). Level 1–6, Korean Language Institute. Yonsei University.
Yonsei Learner’s Corpus (2002). Institute of Korean studies. Yonsei University.
This Research was supported by Sookmyung Women’s University Research Grants (1–1903-2012).
The authors declare that they have no competing interests.
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
About this article
Cite this article
Chun, J., Kim, M.H. Corpus-informed application based on Korean Learners’ Corpus: substitution errors of topic and nominative markers. Asian. J. Second. Foreign. Lang. Educ. 6, 13 (2021). https://doi.org/10.1186/s40862-021-00112-7