Convert commonmark links to Headings with spaces to GitHub flavored markdown.


My question was very badly written but the new title reflect the actual question. Thanks to 3 very friendly and dedicated users (@harsh3466 @tuna @learnbyexample) I was able to find a solution for my files, so thank you guys !!!

For those who will randomly come across this post here are 3 possible ways to achieve the desired results.

Solution 1 (

#! /bin/bash

mdlinks="$(grep -Po ']\((?!https).*\)' "$files")"
mdlinks2="$(grep -Po '#.*' <<<$mdlinks)"

while IFS= read -r line; do
    #Converts 1.2 to 1-2 (For a third level heading needs to add a supplementary [0-9]) 
    dashlink="$(echo "$line" | sed -r 's|(.+[0-9]+)\.([0-9]+.+\))|\1-\2|')"
    sed -i "s/$line/${dashlink}/" "$files"

    #Puts everything to lowercase after a hashtag
    lowercaselink="$(echo "$dashlink" | sed -r 's|#.+\)|\L&|')"
    sed -i "s/$dashlink/${lowercaselink}/" "$files"

    #Removes spaces (%20) from markdown links after a hashtag
    spacelink="$(echo "$lowercaselink" | sed 's|%20|-|g')"
    sed -i "s/$lowercaselink/${spacelink}/" "$files"

done <<<"$mdlinks2"


Solution 2 (

sed -E ':l;s/(\[[^]]*\]\()([^)#]*#[^)]*\))/\1\n\2/;Te;H;g;s/\n//;s/\n.*//;x;s/.*\n//;/^https?:/!{:h;s/^([^#]*#[^)]*)(%20|\.)([^)]*\))/\1-\3/;th;s/(#[^)]*\))/\L\1/;};tl;:e;H;z;x;s/\n//;'


Solution 3 (

perl -pe 's/\[[^]]+\]\((?!https?)[^#]*#\K[^)]+(?=\))/lc $&=~s:%20|\d\K\.(?=\d):-:gr/ge'


Relevant links

  • I've got a sed regex that should work, just writing up a breakdown of the whole command so anyone interested can follow what it does. Will post in a bit.

    • This would be awesome ! A breakdown of the whole command will give me a better understanding !

      Thank you in advance, waiting for your post :)

      • Okay, here's the command and a breakdown. I broke down every part of the command, not because I think you are dumb, but because reading these can be complicated and confusing. Additionally, detailed breakdowns like these have helped me in the past.

        The command:

        sed -ri 's|]\(#.+\)|\L&|; s|%20|-|g' /path/to/somefile

        The breakdown:

        sed - calls sed

        -r - allows for the use of extended regular expressions

        -i - edit the file given as an argument at the end of the command (note, the i flag must follow the r flag, or the extended regular expressions will not be evaluated)

        Now the regex piece by piece. This command has two substitution regex to break down the goals into managable chunks.

        Expression one is to convert the markdown links to lowercase. That expression is:


        The goal of this expression is to find markdown links, and to ignore https links. In your post you indicate the markdown links all start with a # symbol, so we don't have to explicitly ignore the https as much as we just have to match all links starting with #. Here's the breakdown:

        ' - begins the entire expression set. If you had to match the ' character in your expression you would begin the expression set with " instead of '.

        s| - invoking find and replace (substitution). Note, Im using the | as a separator instead of the / for easier readability. In sed, you can use just about any separator you want in your syntax

        ]\(# - This is how we find the link we want to work on. In markdown, every link is preceded by ]( to indicate a closing of the link text and the opening of the actual url. In the expression, the ( is preceded by a \ because it is a special regex character. So \( tells sed to find an actual closing parentheses character. Finally the # will be the first character of the markdown links we want to convert to lowercase, as indicated by your example. The inclusion of the # insures no https links will be caught up in the processing.

        .+ - this bit has two parts, . and +. These are two special regex characters. the . tells sed to find any character at all and the + tells it to find the preceding character one or more times. In the case of .+, it's telling sed to find one or more of any characters. You might think this will eat ALL of the text in the document and make it all lowercase, but it will not because of the next part of the regex.

        \) - this tells sed to find a closing parentheses. Like the opening parentheses, it is a special regex character and needs to be escaped with the backslash to tell sed to find an actual closing parentheses character. This is what stops the command from converting the entire document to lowercase, because when you combine the previous bit with this bit like so .+\), you're telling sed to find one or more of any character UNTIL you find a closing parentheses.

        | - This tells sed we're done looking for text to match. The next bits are about how to modify/replace that text

        \L - This tells sed to convert the given text to all lowercase

        & - This is the given text to modify. In this case the & is a special mertacharacter that tells sed to modify the entire pattern matched in the matching portion of the expression. So when the & is preceded by the \L, this tells sed Take everything that was matched in the pattern matching expression and convert it to lowercase.

        ; - this tells sed that this is the end of the first expression, and that more are coming.

        So all together, what this first expression does is: Find a closing bracket followed by an opening parentheses followed by a pound/hash symbol followed by one or more of any characters until finding a closing parentheses. Then convert that entire chunk of text to lowercase. Because symbols don't have case you can just convert the entire matched pattern to lowercase. If there were specific parts that had to be kept case sensitive, then you'd have to match and modify more precisely.

        The next expression is pretty easy, UNLESS any of your https links also include the string %20:

        If no https links contain the %20 string, then this will do the trick:


        s| - again opens the expression telling sed wer're looking to substitute/modify text

        %20 - tells sed to find exactly the character sequence %20

        | - ends the pattern matching portion of the expression

        - - tells sed to replace the matched pattern with the exact character -

        | - tells sed that's the end of the modification instructions

        g - tells sed to do this globally throughout the document. In other words, to find all occurrances of the string %20 and replace them with the string -

        ' - tells sed that is the end of the expression(s) to be evaluated.

        So all together, what this expression does is: Within the given document, find every occurrence of a percent symbol followed by the number two followed by the number zero and replace them with the dash character.

        /path/to/somefile - tells sed what file to work on.

        Part of using regex is understanding the contents of your own text, and with the information and examples given, this should work. However, if the markdown links have different formatting patterns, or as mentioned any of the https links have the %20 string in them, or other text in the document might falsely match, then you'd have to provide more information to get a more nuanced regex to match.

        Edit: clarified the use of the & metacharacter.

        Edit 2: clarified that the + metacharacter indicates finding the preceding character (or character set) one or more times.

  • Here's a solution with perl (assuming you don't want to change http/https after the start of ( instead of start of a line):

            perl -pe 's/\[[^]]+\]\(\K(?!https?)[^)]+(?=\))/lc $&=~s|%20|-|gr/ge' ip.txt
    • e flag allows you to use Perl code in the substitution portion.
    • \[[^]]+\]\(\K match square brackets and use \K to mark the start of matching portion (text before that won't be part of $&)
    • (?!https?) don't match if http or https is found
    • [^)]+(?=\)) match non ) characters and assert that ) is present after those characters
    • $&=~s|%20|-|gr change %20 to - for the matching portion found, the r flag is used to return the modified string instead of change $& itself
    • lc is a function to change text to lowercase
  • Not home so I can't try it but do you need to be so specific to match the whole markdown syntax?

    You might be able to get away with

    s/#(\w+%20)*\w+\.\w{2,3}/\L&/g; /#(\w+%20)*\w+\.\w{2,3}/ s/%20/-/g

    basically, matching as opposed to matching whole markdown link

    lowercasing that entire match, then on a search matching stuff that looks like that, replace the %20 with a hyphen (combined into a single sed command). this only fails when an http link falls within the same line as a markdown hyperlink

